Distinct node
Duplicate records in a data set must be removed before data mining can begin. For example, in a marketing database, individuals may appear multiple times with different address or company information. You can use the Distinct node to find or remove duplicate records in your data, or to create a single, composite record from a group of duplicate records.
To use the Distinct node, you must first define a set of key fields that determine when two records are considered to be duplicates.
If you do not pick all your fields as key fields, then two "duplicate" records may not be truly identical because they can still differ in the values of the remaining fields. In this case, you can also define a sort order that is applied within each group of duplicate records. This sort order gives you fine control over which record is treated as the first within a group. Otherwise, all duplicates are considered to be interchangeable and any record might be selected. The incoming order of the records is not taken into account, so it doesn't help to use an upstream Sort node (see "Sorting records within the Distinct node" on this page).
Mode. Specify whether to create a composite record, or to either include or exclude (discard) the first record.
- Create a composite record for each group. Provides a way for you to aggregate non-numeric fields. Selecting this option makes the Composite tab available where you specify how to create the composite records.
- Include only the first record in each group. Selects the first record from each group of duplicate records and discards the rest. The first record is determined by the sort order defined under the setting Within groups, sort records by, and not by the incoming order of the records.
- Discard only the first record in each group. Discards the first record from each group of duplicate records and selects the remainder instead. The first record is determined by the sort order defined under the setting Within groups, sort records by, and not by the incoming order of the records. This option is useful for finding duplicates in your data so that you can examine them later in the flow.
Key fields for grouping. Lists the field or fields used to determine whether records are identical. You can:
- Add fields to this list using the field picker button.
- Delete fields from the list by using the red X (remove) button.
Within groups, sort records by. Lists the fields used to determine how records are sorted within each group of duplicates, and whether they are sorted in ascending or descending order. You can:
- Add fields to this list using the field picker button.
- Delete fields from the list by using the red X (remove) button.
- Move fields using the up or down buttons, if you are sorting by more than one field.
You must specify a sort order if you have chosen to include or exclude the first record in each group, and it matters to you which record is treated as the first.
You may also want to specify a sort order if you have chosen to create a composite record, for certain options on the Composite tab.
Specify whether, by default, records are sorted in Ascending or Descending order of the sort key values.
Sorting records within the Distinct node
If the order of records within a group of duplicates is important to you, then you must specify the order using the Within groups, sort records by option in the Distinct node. Do not rely on an upstream Sort node. Remember that the incoming order of the records is not taken into account -- only the order specified within the node.
If you do not specify any sort fields (or you specify insufficient sort fields), then the records within each group of duplicates will be unordered (or incompletely ordered) and the results may be unpredictable.
For example, assume we have a very large set of log records pertaining to a number of machines. The log contains data such as the following:
Timestamp | Machine | Temperature |
---|---|---|
17:00:22 | Machine A | 31 |
13:11:30 | Machine B | 26 |
16:49:59 | Machine A | 30 |
18:06:30 | Machine X | 32 |
16:17:33 | Machine A | 29 |
19:59:04 | Machine C | 35 |
19:20:55 | Machine Y | 34 |
15:36:14 | Machine X | 28 |
12:30:41 | Machine Y | 25 |
14:45:49 | Machine C | 27 |
19:42:00 | Machine B | 34 |
20:51:09 | Machine Y | 36 |
19:07:23 | Machine X | 33 |
To reduce the number of records down to the latest record for each machine,
use Machine
as the key field and use Timestamp
as the sort field
(in descending order). The input order does not affect the result because the sort selection
specifies which of the many rows for a given Machine is to be returned, and the final data output
would be as follows.
Timestamp | Machine | Temperature |
---|---|---|
17:00:22 | Machine A | 31 |
19:42:00 | Machine B | 34 |
19:59:04 | Machine C | 35 |
19:07:23 | Machine X | 33 |
20:51:09 | Machine Y | 36 |