Working with Text DataSource

In this section we will discuss the different ways of adding the text data sources.

Using Separator Characters

If you are given a text file in which the fields are enclosed within quotes and are separated by semicolon, you can easily add the text file as a datasource. In this example, we use the EmpInfo.txt file from the Elixir Data Designer samples.

  1. Choose a file system or folder and use the popup menu to select Add -> DataSource.

  2. The DataSource Wizard appears. Select the Text DataSource and click Next.

  3. The Define Text DataSource screen appears.

  4. Enter a unique name, such as Textsep.

  5. Enter the path of the file in the text box provided. Alternatively, click the browse button to the right of the text field, and select the file from the Open dialog window.

  6. Select ASCII as the Encoding option, as the EmpInfo.txt file is in ASCII format.

  7. The Date format, Time format and Timestamp format text fields can have default values.

  8. Select the First line is header check box.

  9. Select the Separator Character Access type. After setting the properties, the screen appears as shown in Figure 6.6, “Sample Text DataSource”. Click Next.

    Figure 6.6. Sample Text DataSource

    Sample Text DataSource
  10. Select Semicolon as the Separator.

  11. Select " as the Qualifier.

  12. Click Infer Schema and click Finish.

  13. You can open the Textsep.ds data source and click Load Data to view the output.

Defining a URL with a Dynamic Parameter

Here is how to add a text file by passing a dynamic parameter to the URL.

  1. Open the Add -> DataSource wizard to create a Text DataSource as before.

  2. On the "Define Text DataSource" screen, enter a unique name such as Text_URL.

  3. Enter the URL as ${URL}. This indicates that a dynamic parameter will be inserted here when we use the data source.

  4. Follow the rest of the steps as described in the previous section.

  5. Click Infer Schema to bring up a dialog as shown in Figure 6.7, “Dynamic Parameters”.

    Figure 6.7. Dynamic Parameters

    Dynamic Parameters
  6. Enter the URL (file path) as given below (modify according to the location of your EmpInfo.txt file - remember in URLs all slashes are '/' even on Microsoft operating systems).

    file:/C:/EmpInfo.txt
  7. The columns in the data source are listed in the window. Click Finish.

  8. The Text_URL data source is added to the repository. Select and double click the Text_URL.ds data source. Click the Load Data menu, to display the Dynamic Parameters dialog window. Enter the URL in the text box. Click Finish to view the output. You can substitute any other file URL that conforms to the same schema (i.e. has the same number and type of fields) and the data source will load it correctly.

Using Fixed Width Columns

Use the following procedure to add a text file having columns of fixed width.

  1. Use the file system or folder popup menu to select Add -> DataSource.

  2. The DataSource Wizard appears. Select the Text DataSource and click Next.

  3. The "Define Text DataSource" screen appears.

  4. Enter a unique name, such as Text_Width.

  5. Enter the URL in the text box provided. Alternatively, click the button to the right of the text field to invoke the selector. In this case, we select the DateJoin.txt file.

  6. Select the First Line is Header option.

  7. Select Fixed Width as the Access type and click Next.

  8. Add a column named Eno with data type String and width 3.

  9. Add a second column named Date Join with data type String and width 12.

  10. the

    The columns are added to the Wizard. The screen appears as shown in Figure 6.8, “Fixed Width Sample”. Click Finish.

    Figure 6.8. Fixed Width Sample

    Fixed Width Sample
  11. The Text DataSource is added to the repository. Select and double click the Text_Width.ds data source. Click the Load Data menu to verify the output.

Using Regular Expressions

Perform the following steps to extract data from a log file such as Figure 6.9, “Log file”. This can be achieved easily by using the Regular Expression Access type:

Figure 6.9. Log file

Log file
  1. Use the filesystem or folder popup menu to select Add -> DataSource.

  2. The DataSource Wizard appears. Select the Text DataSource and click Next.

  3. The "Define Text DataSource" screen appears.

  4. Enter a unique name such as Text_RegExp.

  5. The URL in this case will be the path to Server.log (you can use the Open dialog to find it).

  6. Select the Access type as Regular Expression and click Next.

  7. The "Define Text DataSource" screen appears. The Regular Expression details are entered in this screen.

  8. Click the ... button on the right of the Regular Expression text box. The Regular Expression Designer Window appears.

  9. In the RegExp text box, enter the Regular Expression as given below

    ([^:]*):(.*)

    This breaks the log into two chunks. The first part ([^:]*) reads up to the first colon. The second part (.*) reads everything else. There is an explicit colon in the middle, so this is not part of either chunk. See a regular expression reference for more help in interpreting this syntax.

  10. Click Test to check the syntax status. If there are no errors in the syntax, the columns are displayed in the window of the designer as shown in Figure 6.10, “Regular Expression Designer”. The first column shows the original line, the subsequent two columns (in this case) show the portion of text that matches the groups in the expression. Click Ok to assign the syntax to the Regular Expression text box.

    Figure 6.10. Regular Expression Designer

    Regular Expression Designer
  11. Click Infer Schema, to infer the columns based on the groups in the regular expression.

  12. Click Finish. The Text_RegExp data source is added to the repository.

  13. Select and double click the Text_RegExp.ds data source. Click Load Data to ensure correct output. We have split the data into two fields, but this particular log has two different formats embedded in it. The text at the bottom, below "**Unmatched Entries**" is in a different format. Notice that the line "**Unmatched Entries**" is not in the output, because it does not include a colon, so it is automatically discarded. We can discard the lines below this with a Stop expression.

Using Start and Stop Expressions

The following process describes how to add a log file using only part of the data from the file. To do this, the Start Expression and Stop Expressions are used along with the Regular Expression.

  1. Create a data source as described in the previous example, or modify the one you have already created.

  2. In the Regular Expression designer window click the ... button to the right of the Stop Expression text box. The Regular Expression designer window appears.

  3. In the Stop RegExp text box enter "**Unmatched Entries**" (without the quotes). You will notice a syntax error is reported because * is a special character in regular expressions. Insert a backslash before each * to have them handled literally: "\*\*Unmatched Entries\*\*". This will remove the syntax error warning. Click Test. Only one record will be found by this expression, the line which marks the start of the second log section. Click Ok to assign the syntax in the Start Expression text box.

  4. Click Infer Schema to infer the columns.

  5. Click Finish.

If you view the output of this data source, you will see that only the lines up to the Stop expression are processed, and those with a different format below are now skipped. If you want only those items that are at the bottom, you could move the Stop expression to the Start expression, and now only records after the Start matches would be processed. The Text DataSource can support the extraction of multiple chunks by using combinations of start and stop.