The first Text DataSource properties screen is shown in Figure 6.1, “Define Text DataSource”.
The Name of the data source is entered in the Name text box and any extra description that is used to describe the data source can be entered in the Description text box.
By default, the text files are in ASCII format. If you need to access a text file with an alternate encoding, select the encoding type from the Encoding combo box.
The format of any dates in the text file is entered in the Date format text field.
The format of any times in the text file is entered in the Time format text field.
The format of any timestamps (date and time) in the text file is entered in the Timestamp format text field.
By default, the date, time and timestamp formats are specified according to ISO standards.
The URL of the file containing the text is specified in the
File URL text box. Alternatively, by clicking the browse button on the
right of the text box, you can select the file from
the Open dialog window.
Note that this URL can access not only
local files, but any files accessible through HTTP, FTP or from
the repository (use the repository: prefix - for example ftp://
).
There are three different ways of extracting data from a text file: Separator character, Fixed Width and Regular Expression.
Separator Character
On selecting the Separator Character access type and clicking Next, the screen as shown in Figure 6.2, “Separator Type Properties” appears. The separator details are entered in this screen.
The type of separator is chosen from among the different Separator options. If there is any special type of separator other than those specified, then that particular separator must be entered in the "Others" text box. The type of qualifier used in the text file can be selected or entered in the combo box.
Click Infer Schema to infer the schema.
Click Add Column to display the Add Column dialog window.
The name of the new column is entered in the text box and the data type of the field is selected from the combo box. The new column is added to the data source.
The column can be edited by selecting Edit Column. The Move Up and Move Down buttons are used to reorder the columns. Click Remove Column to delete a column.
Fixed Width
Select Fixed Width
as the access type and click
Next to display the screen shown in Figure 6.3, “Fixed Width Type Properties”.
Click Add Column to bring up the Add Column
dialog.
Enter the name of the column and select the data type of the field.
Enter the width of the column and click OK. The new column is then added to the data source.
Use the Edit Column, Move Up, Move Down and Remove Column buttons, to edit, re-order or delete the columns as necessary.
Regular Expression
Select the access type as Regular Expression
and click Next, to display the screen shown in Figure 6.4, “Regular Expression Type Properties”.
Enter the Regular Expression.
Alternatively, click the button to the right of the text box, to display a separate designer window as shown in Figure 6.5, “Regular Expression Designer”.
Enter the Regular expression in the RegExp
text box and
click the Test button to test the validity of the
expression. If the expression is not valid, syntax errors
are displayed. If there are no syntax errors, then the fields
are displayed in the window. Click Ok to
assign the expression in the Regular Expression
text box.
Regular expression processing requires the entry of up to
three regular expressions. The first one, named Regular
Expression is mandatory. This selects the records that are
included in the result and indicates how the field values are extracted
from these records. For example, a regular expression such as
^abc(.*)
will only pass through those records where the line
begins with abc
(^ marks the beginning of the line) and extracts
a single field, matched by the rest of the line (.*) meaning a
group which contains zero or more characters). Similarly,
(.*),(.*)
will extract two fields, separated by a comma. In
this case, all lines will be used, as there is no filter
criteria, like the ^abc
in the earlier example. There are
plenty of books and on-line resources which describe
regular expression syntax in detail.
So far we have only looked at one regular expression. There
are two more: Start
and Stop
.
These are optional, but if supplied will turn on, and turn off record processing. If a Start
expression is supplied, all lines will be discarded until this
Start expression is matched. Subsequent lines will be processed
as described previously until the Stop expression (if supplied)
is matched. Subsequent lines will again be discarded. A brief
example demonstrates this concept.
Summary Jon,Total Bill,Total Details Jon Jon,First Jon,Second Bill Bill,First Bill,Second Bill,Third Comments None,Nothing
Given the above text file, we can extract just the details by using these three expressions: Start = "Details", Stop = "Comments", Regular Expression = "(.*),(.*)". Of course, you leave out the quotation marks when entering expressions into the text fields. This configuration will ignore all the Summary values (even though they match the Regular Expression criteria of two fields separated by a comma). Once we reach the line which matches the start criteria "Details", the Regular Expression matching starts. The next line just contains "Jon", which does not match the Regular Expression requirements, and is therefore discarded. "Jon,First" and "Jon,Second" do match, so they are passed through as records. "Bill" does not match, so it is skipped, but "Bill,First", "Bill,Second" and "Bill,Third" are passed through. Finally, we reach a line which matches the Stop expression, so subsequent line processing is turned off (which means "None,Nothing" is discarded). To summarize, five records, each containing two fields are extracted:
Jon,First Jon,Second Bill,First Bill,Second Bill,Third
Enter the Start Expression in the Start Expression text box. Alternatively, click the button to the right of the text box, to display the Regular Expression Designer Window. The Start Expression can be entered in the RegExp text box, tested and added.
Enter the Stop Expression the Stop Expression text box. Alternatively, click the button to the right of the text box, to display the Regular Expression Designer Window. The Stop Expression can be entered in the RegExp text box, tested and added.
Click Infer Schema to infer the text data source columns based on the number of groups in the regular expression.