Processor

Elixir Data Designer provides a generic processor that allows certain specific data processing tasks to be executed. These tasks are designed to be easily customized in different versions of the tools, so you may find additional options available.

The standard processors are:

Cleansing

  • Remove Duplicates

    This task removes duplicate records. A duplicate record is determined by a set of key fields being the same. The set may contain a single field, for example an identity card number, or a collection of fields, for example name and address. The fields are chosen by selecting a checkbox next to the appropriate items in the Test column of the schema table, as shown in Figure 4.39, “Remove Duplicates Processor”

    Figure 4.39. Remove Duplicates Processor

    Remove Duplicates Processor

    Whenever subsequent records are identified as being duplicates of those already processed, the later records are always discarded. This means only the first record with a given set of key fields will be passed through. If the records are sorted such that duplicate records are adjacent, you should tick the Input Sorted checkbox so that the system can use an alternate algorithm to reduce memory usage. When the Input Sorted checkbox is selected, each record is only compared against the previous record, instead of all previous records, resulting in faster performance and reduced memory requirements.

  • Validation

    This task checks records against column attribute constraints and shows error messages if column attributes fail to pass the validation. The output of validation is shown in two fields. One is the elx.validation.FailCount (Integer) field, which counts the number of failures in each row. The other is the elx.validation.FailMessages (String) field, which shows the failed attributes and column names. A sample output of the Range attribute is shown in Figure 4.40, “Validation Processor”.

    Figure 4.40. Validation Processor

    Validation Processor

General

  • Attributes

    This task allows the user to specify attributes in a convenient way. The following types of attributes are supported:

    • Comments: Enter text for comments.

    • Enumeration: By defining the attributes of individual fields, Repertoire Data Designer can then validate the record values using the cleansing process. For example, we might specify that the field "Gender" is a nominal enumeration of "M" and "F". The cleansing process would then warn us of any records containing "m" or "Female". Ordinal enumeration attributes allow us to specify the order to display data, while retaining the original order. For example, we might specify that the field "Fruit" is an ordinal enumeration of "Apple" and "Orange". Only "Apple" and "Orange" records can pass the validation in cleansing process.

    • ForeignKey: This type of attributes is created by JDBC DataSources, and therefore cannot be edited.

    • Format: Define the format.

    • Nullable: Specify whether the data can be Null values.

    • PrimaryKey: Complete a uniqueness check that all keys should be unique.

    • Range: Specify the start value and end value.

    • RegExp: Define and test syntax.

  • Debug

    This task allows the user to monitor the flow of records through the console without having to stop the flow. From there, user will be able to know where the problem lies and resolve the issue.

    Before being able to see the data flow through the console, user will need to edit log-config.xml. Look out for the following in the file and change INFO to DEBUG.

    <logger name="com.elixirtech.data2">
      <level value="INFO" />
    </logger>

    This task is usually used after another processor which processes the data and it is at this stage where data gets messed up. Therefore, with the use of Debug, it will be easier to point out the source of problem.

  • Invert Data

    This task inverts the data records so that the rows become the columns and vice-versa. In order to convert the rows to columns, one original column must provide the names of the columns in the new schema. This column should contain unique values to ensure the resulting schema doesn't have duplicate column names. Select the column that will provide the column names at the top of the wizard, as shown in Figure 4.41, “Invert Data Processor”. Conversely, you might want the old column names to be retained as row values. If so, enter a field name in the next text field, otherwise leave it blank. Specifying a name will add a column in the output schema of type String, whose values will be equal to the Keep columns described below.

    Figure 4.41. Invert Data Processor

    Invert Data Processor

    The next step is to identify which of the old columns to keep - those that should be mapped into the new inverted structure. The resulting dataset will contain only as many records as are marked Keep. Because each row becomes a column when inverted, the system will need to determine the appropriate column type in the output schema. If the selected columns are all of the same type, then this type will be used. Alternatively, if the selected columns contain mixed types, then the String type will be used in the output schema to ensure all values can be represented.

    As an example, assume a simple table like this:

    A  B  C  D
    1  2  3  4
    5  6  7  8 

    Where A,B,C and D are the column names. If we choose InvertColumn=B, FieldName=Field, Keep=A,D then the output after Invert Data will be:

    Field 2 6
      A   1 5
      D   4 8 

    As you can see, because two columns were selected for keeping, there are two records in the output. The old column names are assigned to be the values of the new column called Field (the FieldName value), and the unique values of the B column (2 and 6) become the column names in the new schema.

  • Javascript

    This task allows the user to perform any kind of record operation or schema re-arrangement which might be too complicated for regular flows. With this processor, a user can now save a lot of complex joins, filters and concatenations which is required previously.

    The scripts are to be entered in the text field in the respective tabs, as shown in Figure 4.42, “Javascript Processor”

    Figure 4.42. Javascript Processor

    Javascript Processor

  • Map Field

    Using Map Field, we can reference to another datasource and use the field(s) in that datasource and make our Composite datasource contain the field(s) in that selected datasource. The field name can be amended if desired.

    In the Map Field Wizard, choose a datasource from the repository to reference. Then add the fields to map.

    Figure 4.43. Map Field Wizard

    Map Field Wizard

  • SQL

    This task will allow the user to send any SQL commands to the database based on the standard flow events after a JDBC connection is set up on the second page of Processor Wizard, as seen in Figure 4.44, “SQL Processor Wizard”.

    Figure 4.44. SQL Processor Wizard

    SQL Processor Wizard

    The output of the SQL Processor is the same as the input and records passing through are not being modified as the user is only interacting with the database. With this, the user can load data in bulk into the database faster.

  • Trim

    This task removes the leading and trailing spaces of string data.