In this section we will discuss the different ways of adding the text data sources.
If you are given a text file in which the fields are enclosed
within quotes and are separated by semicolon, you can easily add
the text file as a datasource. In this example, we use the EmpInfo.txt
file from the Elixir Data Designer samples.
Choose a file system or folder and use the popup menu to select Add -> DataSource.
The DataSource Wizard appears. Select the Text DataSource and click Next.
The Define Text DataSource
screen appears.
Enter a unique name, such as
Textsep
.
Enter the path of the file in the text box provided. Alternatively, click the browse button to the right of the text field, and select the file from the Open dialog window.
Select ASCII as the Encoding option, as the
EmpInfo.txt
file is in ASCII format.
The Date format, Time format and Timestamp format text fields can have default values.
Select the First line is header
check box.
Select the Separator Character Access type. After setting the properties, the screen appears as shown in Figure 6.6, “Sample Text DataSource”. Click Next.
Select Semicolon
as the Separator.
Select "
as the Qualifier.
Click Infer Schema and click Finish.
You can open the Textsep.ds
data source and
click Load Data to view the output.
Here is how to add a text file by passing a dynamic parameter to the URL.
Open the Add -> DataSource wizard to create a Text DataSource as before.
On the "Define Text DataSource" screen, enter a
unique name such as Text_URL
.
Enter the URL as ${URL}
. This indicates that a dynamic
parameter will be inserted here when we use the
data source.
Follow the rest of the steps as described in the previous section.
Click Infer Schema to bring up a dialog as shown in Figure 6.7, “Dynamic Parameters”.
Enter the URL (file path) as given below (modify according to the
location of your EmpInfo.txt
file - remember in URLs
all slashes are '/' even on Microsoft operating
systems).
file:/C:/EmpInfo.txt
The columns in the data source are listed in the window. Click Finish.
The Text_URL
data source is added to the repository.
Select and double click the Text_URL.ds
data source.
Click the Load Data menu, to display the Dynamic Parameters
dialog window. Enter the URL in the text box.
Click Finish to view the output.
You can substitute any other file URL that conforms
to the same schema (i.e. has the same number and type of
fields) and the data source will load it
correctly.
Use the following procedure to add a text file having columns of fixed width.
Use the file system or folder popup menu to select Add -> DataSource.
The DataSource Wizard appears. Select the Text
DataSource
and click Next.
The "Define Text DataSource" screen appears.
Enter a unique name, such as
Text_Width
.
Enter the URL in the text box provided.
Alternatively, click the button to the right of
the text field to invoke the selector. In this case, we
select the DateJoin.txt
file.
Select the First Line is Header
option.
Select Fixed Width
as the Access type and click
Next.
Add a column named Eno
with data type
String and width 3.
Add a second column named Date Join
with
data type String and width 12.
The columns are added to the Wizard. The screen appears as shown in Figure 6.8, “Fixed Width Sample”. Click Finish.
The Text DataSource is added to the repository.
Select and double click the Text_Width.ds
data
source. Click the Load Data menu to verify
the output.
Perform the following steps to extract data from a log file such as Figure 6.9, “Log file”. This can be achieved easily by using the Regular Expression Access type:
Use the filesystem or folder popup menu to select Add -> DataSource.
The DataSource Wizard appears. Select the Text DataSource and click Next.
The "Define Text DataSource" screen appears.
Enter a unique name such as
Text_RegExp
.
The URL in this case will be the path to Server.log (you can use the Open dialog to find it).
Select the Access type as Regular Expression
and
click Next.
The "Define Text DataSource" screen appears. The Regular Expression details are entered in this screen.
Click the ... button on the right of the Regular Expression text box. The Regular Expression Designer Window appears.
In the RegExp text box, enter the Regular Expression as given below
([^:]*):(.*)
This breaks the log into two chunks. The first part ([^:]*)
reads up to the first colon. The second part (.*)
reads
everything else. There is an explicit colon in the middle, so this
is not part of either chunk. See a regular expression reference for
more help in interpreting this syntax.
Click Test to check the syntax status. If there are no errors in the syntax, the columns are displayed in the window of the designer as shown in Figure 6.10, “Regular Expression Designer”. The first column shows the original line, the subsequent two columns (in this case) show the portion of text that matches the groups in the expression. Click Ok to assign the syntax to the Regular Expression text box.
Click Infer Schema, to infer the columns based on the groups in the regular expression.
Click Finish.
The Text_RegExp
data source is added to the repository.
Select and double click the Text_RegExp.ds
data
source. Click Load Data to ensure correct output.
We have split the data into two fields, but
this particular log has two different formats embedded
in it. The text at the bottom, below "**Unmatched Entries**"
is in a different format. Notice that the line "**Unmatched Entries**"
is not in the output, because it does not include
a colon, so it is automatically discarded. We can discard
the lines below this with a Stop expression.
The following process describes how to add a log file using only part of the data from the file. To do this, the Start Expression and Stop Expressions are used along with the Regular Expression.
Create a data source as described in the previous example, or modify the one you have already created.
In the Regular Expression designer window click the ... button to the right of the Stop Expression text box. The Regular Expression designer window appears.
In the Stop RegExp
text box enter "**Unmatched Entries**"
(without the quotes). You will notice a syntax error is
reported because * is a special character in regular
expressions. Insert a backslash before each * to have them
handled literally: "\*\*Unmatched Entries\*\*". This will
remove the syntax error warning. Click Test.
Only one record will be found by this expression,
the line which marks the start of the second log section.
Click Ok to assign the syntax in the Start Expression
text box.
Click Infer Schema to infer the columns.
Click Finish.
If you view the output of this data source, you will see that only the lines up to the Stop expression are processed, and those with a different format below are now skipped. If you want only those items that are at the bottom, you could move the Stop expression to the Start expression, and now only records after the Start matches would be processed. The Text DataSource can support the extraction of multiple chunks by using combinations of start and stop.