The function of a Merge node is to take multiple input records and create a single output record containing all or some of the input fields. This is a useful operation when you want to merge data from different sources, such as internal customer data and purchased demographic data.
You can merge data in the following ways.
- Merge by Order concatenates corresponding records from all sources in the order of input until the smallest data source is exhausted. It is important if using this option that you have sorted your data using a Sort node.
- Merge using a Key field, such as
Customer ID
, to specify how to match records from one data source with records from the other(s). Several types of joins are possible, including inner join, full outer join, partial outer join, and anti-join. - Merge by Condition means that you can specify a condition to be satisfied for the merge to take place. You can specify the condition directly in the node, or build the condition using the Expression Builder.
Types of joins
When using a key field for data merging, you may find it useful to spend some
time thinking about which records will be excluded and which will be included. There are a variety
of joins, which are discussed here. The two basic types of joins are referred to as inner and outer
joins. These methods are frequently used to merge tables from related datasets based on common
values of a key field, such as Customer ID
. Inner joins allow for clean merging and
an output dataset that includes only complete records. Outer joins also include complete records
from the merged data, but they also allow you to include unique data from one or more input tables.
The types of joins allowed are described in greater detail as follows.
An inner join includes only records in which a value for the key field is common to all input tables. That is, unmatched records won't be included in the output dataset.
A full outer join includes all records, both matching and nonmatching, from the input tables. Left and right outer joins are referred to as partial outer joins.
A partial outer join includes all records matched using the key field as well as unmatched records from specified tables. (Or, to put it another way, all records from some tables and only matching records from others.) You can select tables (such as A and B shown here) for inclusion in the outer join under the Merge options of the node properties. Partial joins are also called left or right outer joins when only two tables are being merged. Since SPSS Modeler allows the merging of more than two tables, we refer to this as a partial outer join.
An anti-join includes only unmatched records for the first input table (Table A shown here). This type of join is the opposite of an inner join and doesn't include complete records in the output dataset.
For example, if you have information about farms in one dataset and farm-related insurance claims in another, you can match the records from the first source to the second source using the Merge options. To determine if a customer in your farm sample has filed an insurance claim, use the inner join option to return a list showing where all IDs match from the two samples.
Using the full outer join option returns both matching and nonmatching records
from the input tables. The system-missing value ($null$
) will be used for any
incomplete values.
A partial outer join includes all records matched using the key field as well as unmatched records from specified tables. The table displays all of the records matched from the ID field as well as the records matched from the first dataset.
If you're using the anti-join option, the table returns only unmatched records for the first input table.