Text DataSource Wizard

The first Text DataSource properties screen is shown in Figure 6.1, “Define Text DataSource”.

Figure 6.1. Define Text DataSource

Define Text DataSource

The Name of the data source is entered in the Name text box and any extra description that is used to describe the data source can be entered in the Description text box.

By default, the text files are in ASCII format. If you need to access a text file with an alternate encoding, select the encoding type from the Encoding combo box.

The format of any dates in the text file is entered in the Date format text field.

The format of any times in the text file is entered in the Time format text field.

The format of any timestamps (date and time) in the text file is entered in the Timestamp format text field.

By default, the date, time and timestamp formats are specified according to ISO standards.

The URL of the file containing the text is specified in the File URL text box. Alternatively, by clicking the browse button on the right of the text box, you can select the file from the Open dialog window. Note that this URL can access not only local files, but any files accessible through HTTP, FTP or from the repository (use the repository: prefix - for example ftp://).

There are three different ways of extracting data from a text file: Separator character, Fixed Width and Regular Expression.

Separator Character

On selecting the Separator Character access type and clicking Next, the screen as shown in Figure 6.2, “Separator Type Properties” appears. The separator details are entered in this screen.

Figure 6.2. Separator Type Properties

Separator Type Properties

The type of separator is chosen from among the different Separator options. If there is any special type of separator other than those specified, then that particular separator must be entered in the "Others" text box. The type of qualifier used in the text file can be selected or entered in the combo box.

Click Infer Schema to infer the schema.

Click Add Column to display the Add Column dialog window.

The name of the new column is entered in the text box and the data type of the field is selected from the combo box. The new column is added to the data source.

The column can be edited by selecting Edit Column. The Move Up and Move Down buttons are used to reorder the columns. Click Remove Column to delete a column.

Fixed Width

Select Fixed Width as the access type and click Next to display the screen shown in Figure 6.3, “Fixed Width Type Properties”.

Click Add Column to bring up the Add Column dialog.

Figure 6.3. Fixed Width Type Properties

Fixed Width Type Properties

Enter the name of the column and select the data type of the field.

Enter the width of the column and click OK. The new column is then added to the data source.

Use the Edit Column, Move Up, Move Down and Remove Column buttons, to edit, re-order or delete the columns as necessary.

Regular Expression

Select the access type as Regular Expression and click Next, to display the screen shown in Figure 6.4, “Regular Expression Type Properties”.

Figure 6.4. Regular Expression Type Properties

Regular Expression Type Properties

Enter the Regular Expression.

Alternatively, click the button to the right of the text box, to display a separate designer window as shown in Figure 6.5, “Regular Expression Designer”.

Figure 6.5. Regular Expression Designer

Regular Expression Designer

Enter the Regular expression in the RegExp text box and click the Test button to test the validity of the expression. If the expression is not valid, syntax errors are displayed. If there are no syntax errors, then the fields are displayed in the window. Click Ok to assign the expression in the Regular Expression text box.

Regular expression processing requires the entry of up to three regular expressions. The first one, named Regular Expression is mandatory. This selects the records that are included in the result and indicates how the field values are extracted from these records. For example, a regular expression such as ^abc(.*) will only pass through those records where the line begins with abc (^ marks the beginning of the line) and extracts a single field, matched by the rest of the line (.*) meaning a group which contains zero or more characters). Similarly, (.*),(.*) will extract two fields, separated by a comma. In this case, all lines will be used, as there is no filter criteria, like the ^abc in the earlier example. There are plenty of books and on-line resources which describe regular expression syntax in detail.

So far we have only looked at one regular expression. There are two more: Start and Stop. These are optional, but if supplied will turn on, and turn off record processing. If a Start expression is supplied, all lines will be discarded until this Start expression is matched. Subsequent lines will be processed as described previously until the Stop expression (if supplied) is matched. Subsequent lines will again be discarded. A brief example demonstrates this concept.

 Summary Jon,Total Bill,Total Details Jon Jon,First 
			Jon,Second Bill Bill,First Bill,Second Bill,Third Comments 
			None,Nothing 

Given the above text file, we can extract just the details by using these three expressions: Start = "Details", Stop = "Comments", Regular Expression = "(.*),(.*)". Of course, you leave out the quotation marks when entering expressions into the text fields. This configuration will ignore all the Summary values (even though they match the Regular Expression criteria of two fields separated by a comma). Once we reach the line which matches the start criteria "Details", the Regular Expression matching starts. The next line just contains "Jon", which does not match the Regular Expression requirements, and is therefore discarded. "Jon,First" and "Jon,Second" do match, so they are passed through as records. "Bill" does not match, so it is skipped, but "Bill,First", "Bill,Second" and "Bill,Third" are passed through. Finally, we reach a line which matches the Stop expression, so subsequent line processing is turned off (which means "None,Nothing" is discarded). To summarize, five records, each containing two fields are extracted:

 Jon,First Jon,Second Bill,First Bill,Second 
			Bill,Third 

Enter the Start Expression in the Start Expression text box. Alternatively, click the button to the right of the text box, to display the Regular Expression Designer Window. The Start Expression can be entered in the RegExp text box, tested and added.

Enter the Stop Expression the Stop Expression text box. Alternatively, click the button to the right of the text box, to display the Regular Expression Designer Window. The Stop Expression can be entered in the RegExp text box, tested and added.

Click Infer Schema to infer the text data source columns based on the number of groups in the regular expression.