The Data Mixers page consists of three tabs, namely “Source”, “Target” and “Mixer”.
The Data Mixer tokenizes and channels the inputs. It then weighs similarity based on a number of factors, including word frequency and certain expected patterns as well as addressing common abbreviations, variations, typos and plurals. All these are kept under the user’s control, and the system provides a breakdown of the decision tree used to inform the identified matches. This makes it is easy for anyone with no technical training to understand the cause and effect of the matching system. It also ensures that the system can be configured to a wide variety of input data and does not need any training with already matched pairs.
The Data Mixer has two parts: an interactive and responsive mode for evaluating and exploring the data, and a scalable cluster/server-based “batch processing” runtime, called from ETL. It is a self-service tool that users can use on their local data and then scale out to parallel processing of large volumes once the appropriate level of confidence has been reached.
The Source and Target tabs allow you to get statistics on the text in a dataset, as well as perform some minor text manipulations.
Each tab consists of four parts:
- Three “Word Frequency” columns
- Column Definitions
- Matching Records panel
The Dataset allows you to select the desired dataset for the data mixer from the drop-down list. The Source and Target tabs should each select different datasets for comparison.
The Word Frequency columns are automatically filled with the first three fields in the dataset and list the words and their respective word counts. You can change the field of the Word Frequency by clicking on the field and select another field from the drop-down list.
You can select any word by clicking on the word itself. The subset of records containing that word will appear in the Matching Records panel.
Alternatively, you can search the list by clicking into the text box below the field and key in the search value. It is case-insensitive and displays the values that have the entered search value.
The Matching Records panel displays the subset of records containing the word that was selected in the any of the Word Frequency columns.
The Column Definitions allows you to define the data parser. There are four commands that are used in the Column Definitions:
These commands are case-sensitive.
This command allows you to remove characters or words deemed as noise.
PREPROCESS "[text-source]" "[text-target]"
For example, replacing characters “(” and “)” from the source text with space characters.
PREPROCESS "(" " " PREPROCESS ")" " "
This command allows you to perform pattern substitution using regular expressions. It allows you to turn variations into a consistent format.
PATTERN "[pattern-original]" "[pattern-target]"
For example, standardising address format, such as BLK 123, Unit 456, etc. Forming tokens which can include whitespace, so the BLK and Unit information are not split.
PATTERN "BLK*([0-9]+(\-[0-9]+)?)" Block "BLK $1" PATTERN "UNIT*([A-Z]?[0-9]+)[-]([A-Z]?[0-9]+(/[0-9]+)?)" Unit "#$1-$2"
This command allows you to replace tokens with others, a 1:1 replacement.
ALIAS text-origial text-alias
For example, replacing “M/M” with “MINIMART”.
ALIAS M/M MINIMART
This command allows you to merge tokens together, a N:1 replacement.
POSTPROCESS [separate-token] [combined-token]
For example, combining words “MINI” and “MART” into one word “MINIMART”.
POSTPROCESS "MINI MART" MINIMART
The Mixer tab allows you to configure and perform the data mixing.
It consists of several parts:
- Patch Bay
- Channel Selectors
- Master Control
- Playback Control
- Mixer Output Panel
The Patch Bay allows you to select which channel to be used for each selected field in the Source and Target tab.
The first three rows indicate the three fields selected for “Word Frequency” in the Source tab and the last three rows indicates the three fields selected for “Word Frequency” in the Target tab.
Click on the appropriate numeral to assign a channel to the column. For a fresh mix, no column is selected.
There are three Channel Selectors which allow you to select how the columns are to be mixed together. Each column in the Source and Target data sources are attached to a mixing selector.
The Master Control consists of two parameters, namely “Sample” and “Cutoff”, for the mixing in each channel. The “Sample” parameter denotes the fraction of the dataset used to generate the on-screen mix. The “Cutoff” parameter denotes the minimum certainty required before the mixed data record is displayed in the demo display.
The Playback Control is located on the right most of the Data Mixer page. It consists of buttons that allows you to run, stop or pause the playback.
By default only the “Run” button is displayed. The other two buttons will only appear when the mixing is in process.
Once the configuration in the Patch Bay, Channel Selectors and Master Control are set, click on the button to start the playback to obtain the mix. After clicking on the button, the “Pause” and “Stop” buttons will appear in the Playback Controls. The mixer output is displayed in the Mixer Output Panel. The process may take a few seconds.
You may click on the or button that appear to stop or pause and resume the playback respectively at any time.
The Mixer Output Panel displays the result of executing the mix. The leftmost column “Certainty” displays the predicted accuracy of the mix. Click on the row to see the details of the weights assigned to the various components below the selected row.
The hide the details, click on the row again.