Processor

The Processor is a generic processor that allows certain specific data processing tasks to be executed. These tasks are designed to be easily customised in different versions of the tools.

The standard processors are:

Processor Description
Cleansing
Remove Duplicates Removes records where certain key fields are duplicates.
Validation Checks records against column attribute constraints and appends columns detailing validation failures.
General
Attributes Allows setting attributes on schema columns.
Debug Outputs debugging messages to the log.
Invert Data Allows record data to become columns in an inverted schema.
JavaScript JavaScript controlled record processor.
Map Field Map the field to a different name base on another source.
SQL Execute SQL commands upon chosen events in the flow.
Transform Applies efficient transformations to records.
Trim Trims all string fields.

Add Processor

To add a Processor onto the digram, select the button on the menu bar and click on the location in the designer diagram. Use the Flow connector to connect the Processor and other processors together.

The Processor is present but the type of processing method is not defined yet. The Processor data to the DataStore processor is a dotted line, which means there is no output data as the Processor is not defined.

Edit Processor

After the processors are connected together, edit the Processor to define the type of processing task.

There are two ways to access the Processor.

  • Double-click on the Processor
  • Right-click on the Processor and select the “Properties” option from the pop-up menu

Either one method will launch the Processor Wizard.

The Processor Wizard allows you to select the type of processing task. The type of processing task can be divided into two groups:

  • Cleansing
  • General

Cleansing

Remove Duplicates

This task removes duplicate records. A duplicate record is determined by a set of key fields being the same. The set may contain a single field (e.g., identify card number), or a collection of fields (e.g., name and address). The fields are chosen by selecting a checkbox next to the appropriate items in the “Test” column of the schema table.

Whenever subsequent records are identified as being duplicates of those already processed, the later records are always discarded. This means only the first record with a given set of key fields will be passed through. If the records are sorted such that duplicate records are adjacent, you should tick the INput Sorted checkbox so that the system can use an alternate algorithm to reduce memory usage. When the “Input Sorted” field is selected, each record is only compared against the previous record, instead of all previous records, resulting in faster performance and reduced memory requirements.

Validation

This task checks records against column attribute constraints and shows error messages of column attributes fail to pass the validation. The output of the validation is shown in two fields.

  • elx.validation.FailCount(Integer) - Counts the number of failures in each row
  • elx.validation.FailMessages(String) - Shows the failed attributes and column names

General

Attributes

This task allows the user to specify attributes in a convenient way.

To specify an attribute, click on the cell under the “Attributes” column of the desired field. A “Data Attribute” dialog box will appear, which allows you to add an attribute or edit/remove an existing attribute.

Click on the “Add” button to add a new attribute. Another dialog box will appear. Select the desired type from the drop-down list.

Attributes Description
Comments Enter text for comments in the textbox.
Enumeration By defining the attributes of individual fields, the Data Designer can then validate the record values using the cleansing process.
For example, we might specify that the field “Gender” is a nominal enumeration of “M” and “F”. The cleansing process would then warn us of any records containing “m” or “Female”.
Ordinal enumeration attributes allows us to specify the order to display data, while retaining the original order. For example, we might specify that the field “Fruit” is an ordinal enumeration of “Apple” and “Orange”. Only “Apple” and “Orange” records can pass the validation in the cleansing process.
ForeignKey This type of attributes is created by JDBC DataSources, and therefore cannot be edited.
Format Define the format.
Nullable Specify whether the data can be Null values.
PrimaryKey Complete a uniqueness check that all keys should be unique.
Range Specify the start and end values.
RegExp Define and test syntax.

Debug

This task allows the user to monitor the flow of records through the console without having to stop the flow. From there, the user will be able to know where the problem lies and resolve the issue.

Before being able to see the data flow through the console, the user will need to edit log-conf.xml. Look out for the following in the file and change INFO to DEBUG.

<logger name="comm.elicirtech.data2">
  <level value="INFO" />
</logger>

This task is usually used after another processor which processes the data and it is at this stage where data gets messed up. Therefore, with the use of processing task, it will be easier to point out the source of the problem.

Invert Data

This task inverts the data records so that the rows become the columns and vice-versa. In order to convert the rows to columns, one original column must provide the names of the columns in the new schema. This column should contain unique values to ensure the resulting schema does not have duplicate column names. Select the column that will provide the column names at the top of the wizard.

Conversely, the old columns names may be retained as row values. If so, enter a field name in the next text field, otherwise leave it blank. Specifying a name will adda a column in the output schema of the type String, whose values will be equal to the “Keep” column.

The next step is to identify which old columns to keep, those that should be mapped into the new inverted structure. The resulting dataset will contain only as many records as are marked “Keep”. Because each row becomes a column when inverted, the system will need to determine the appropriate column type in the output schema. Alternatively, if the selected columns contain mixed types, then the String type will be used in the output schema to ensure all values can be represented.

As an example, assume a simple table like this:

A  B  C  D
1  2  3  4
5  6  7  8

Where A, B, C and D are column names. If we choose InvertColumn=B, FieldNAme=Field, Keep=A,D, then the output after Invert DAta will be:

Field  2  6
  A    1  5
  D    4  8

As the two columns were selected for keeping, there are two records in the output. The old column names are assigned to be the values of the new column called “Field” (the FieldName value), and the unique values of the B column (“2” and “6”) become the column names in the new schema.

JavaScript

This task allows the user to perform any kind of record operation or schema re-arrangement which might be too complicated for regular flows. With this processor, a user can now save a lot of complex joins, filters and concatenations which is required previously.

The scripts are to be entered in the text field in their respective tabs.

In the next page, you can define the output schema by either copying the input schema and/or add new fields individually.

Map Field

Using this option, we can reference to another datasource and use the field(s) in that datasource and make the composite datasource contain field(s) in that selected datasource. The field name can be amended if desired.

In the wizard, choose a datasource from the depository to reference. Then add the fields to map.

In the next page, click on the “Infer Schema” to view the new schema for the output.

SQL

This task allows the user to send any SQL commands to the database based on th standard flow events after a JDBC connection is set up on the second page of the Processor Wizard.

The output of the SQL processor is the same as the input and records passing through are not being modified as the user us only interacting with the database. With this, the user can load data in bulk into the database faster.

Transform

This task allows you to transform the values of the fields.

Select the field that needs to be transformed and select the desired operation. Different operation will require different settings. The screenshot above shows the fields required for the “Compare” operation.

Trim

This task removes the leading and trailing spaces of string data.

Delete Processor

To remove the Processor, simply select the Processor and click on the “delete” key on your keyboard. Alternatively, right-click on the Processor to display the pop-up menu and select “Delete Graphic” option.

Pop-up Menu

To access the pop-up menu of the Processor, right-click on the processor.

There are three functions in the pop-up menu.

Function Description
Properties Launches the Processor Wizard that allows you to edit the task of the Processor.
View Output Switches to the “Data” tab, which displays the records of the datasource after the processing operation.
Delete Graphic Deletes the Processor.