This tutorial provides an example of when you might need to reduce the input data string length. For binomial logistic regression, and auto classifier models that include a binomial logistic regression model, string fields are limited to a maximum of eight characters. Where strings are more than eight characters, you can recode them using a Reclassify node.
This example focuses on a small part of a flow to show the type of errors that might be generated with overlong strings, and explains how to use the Reclassify node to change the string details to an acceptable length. Although the example uses a binomial Logistic Regression node, you can also use the Auto Classifier node to generate a binomial Logistic Regression model.
Try the tutorial
In this tutorial, you will complete these tasks:
Sample modeler flow and data set
This tutorial uses the Reducing Input Data String Length flow in the sample project. The data file used is drug_long_name.csv. The following image shows the sample modeler flow.
Task 1: Open the sample project
The sample project contains several data sets and sample modeler flows. If you don't already have the sample project, then refer to the Tutorials topic to create the sample project. Then follow these steps to open the sample project:
- In Cloud Pak for Data, from the Navigation menu , choose Projects > View all Projects.
- Click SPSS Modeler Project.
- Click the Assets tab to see the data sets and modeler flows.
Check your progress
The following image shows the project Assets tab. You are now ready to work with the sample modeler flow associated with this tutorial.
Task 2: Examine the Data Asset and Type node
Reducing Input Data String Length includes several nodes. Follow these steps to examine the Data Asset and Type node:
- From the Assets tab, open the Reducing Input Data String Length modeler flow, and wait for the canvas to load.
- Double-click the drug_long_name.csv node. This node is a Data Asset node that points to the drug_long_name.csv file in the project.
- Review the File format properties.
- Optional: Click Preview data to see the full data set.
- Double-click the Type node after the Data Asset node. This node specifies field
properties, such as measurement level (the type of data that the field contains), and the role of
each field as a target or input in modeling. The measurement level is a category that indicates the
type of data in the field. The source data file uses three different measurement levels:
- A Continuous field (such as the
Age
field) contains continuous numeric values. - A Nominal field (such as the
Drug
field) has two or more distinct values; in this case,drugA
ordrugB
. - A Flag field (such as the
Sex
field) describes data with multiple distinct values that have an inherent order; in this case,F
, andM
.
For each field, the Type node also specifies a role to indicate the part that each field plays in modeling. The Role is set to Target for the field
Cholesterol_long
, which is the field that indicates whether a customer has Normal or High level of cholesterol. The target is the field for which you want to predict the value.Role is set to Input for the other fields. Input fields are sometimes known as predictors, or fields whose values are used by the modeling algorithm to predict the value of the target field.
- A Continuous field (such as the
- Optional: Click Preview data to see the filtered data set.
Check your progress
The following image shows the Type node. You are now ready to view the Logistic node.
Task 3: Reclassify values
In this task, you run the model and discover an error, Follow these steps to reclassify the values to avoid the error:
- From the Modeling section in the palette, drag the Logistic node onto the canvas and connect it to the existing Type node after the Data Asset node.
- Double-click the Cholesterol_long node to see its properties.
- Select the Binomial procedure (instead of the default Multinomial procedure).
- A Binomial model is used when the target field is a flag or nominal field with two discrete values.
- A Multinomial model is used when the target field is a nominal field with more than two values.
- Click Save.
- Hover over the Cholesterol_long node, and click the Run icon . An error message warns you that the
Cholesterol_long
string values are too long. You can use a Reclassify node to transform the values to fix this issue. Reclassify node is useful for collapsing categories or regrouping data for analysis. - Double-click the Cholesterol (Reclassify) node to see its properties. Notice that the
Reclassify Field is set to
Cholesterol_long
and the New Field Name isCholesterol
. - Click Get values and then expand the Automatically Reclassify section. Add the
Cholesterol_long
values to the original value column. - In the new value column, for the High level of cholesterol original value, type
High
and for the Normal level of cholesterol original value, typeNormal
. These settings shorten the values to avoid the error message.
Check your progress
The following image shows the Reclassify node. You are now ready to check the Filter node.
Task 4: Check the Filter node
Follow these steps to see and check the Filter node:
- Double-click the Filter node to see its properties.
- Notice that this node filters out the
Cholesterol_long
field.
Check your progress
The following image shows the Filter node. You are now ready to define the target.
Task 5: Define the target
You can specify field properties in a Type node. Follow these steps to define the target in the Type node:
- Double-click the Type node after the Filter node to view its properties.
- Click Read values to read the values from your data source and set the field measurement types. The Role tells modeling nodes whether fields are Input (predictor fields) or Target (predicted fields) for a machine-learning process. Both and None are also available roles, along with Partition, which indicates a field that is used to partition records into separate samples for training, testing, and validation. The value Split specifies that separate models are built for each possible value of the field.
- For the Cholesterol field, set the role to Target.
- Click Save.
Check your progress
The following image shows the Type node. You are now ready to generate the model.
Task 6: Generate the model
Follow these steps to view the model output in table format:
- Hover over the Cholesterol (Logistic) node, and click the Run icon .
- From the Outputs section in the palette, drag the Table node onto the canvas, and connect it to the model nugget.
- Hover over the Table node that is connected to the Cholesterol model, and click the Run icon .
- In the Outputs and models pane, click the output results with the name Table to view the table output.
Check your progress
The following image shows the model output.
Summary
This example showed you the type of errors that might be generated with overlong strings, and explains how to use the Reclassify node to change the string details to an acceptable length. Although the example uses a binomial Logistic Regression node, it is equally applicable when using the Auto Classifier node to generate a binomial Logistic Regression model.
Next steps
You are now ready to try other SPSS® Modeler tutorials.