This tutorial provides an example of preparing data for analysis. Preparing data is one of the most important steps in any data-mining project, and traditionally, one of the most time consuming. The Auto Data Prep node handles the task for you, analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques.
You can use the Auto Data Prep node in a fully automated fashion, allowing the node to choose and apply fixes, or you can preview the changes before they're made and accept or reject them. With this node, you can ready your data for data mining quickly and easily, without the need for prior knowledge of the statistical concepts involved. If you run the node with the default settings, models tend to build and score more quickly.
Try the tutorial
In this tutorial, you will complete these tasks:
Sample modeler flow and data set
This tutorial uses the Automated Data Preparation flow in the sample project. The data file used is telco.csv. This example demonstrates the increased accuracy that you can find by using the default Auto Data Prep node settings when building models. The following image shows the sample modeler flow.
Task 1: Open the sample project
The sample project contains several data sets and sample modeler flows. If you don't already have the sample project, then refer to the Tutorials topic to create the sample project. Then follow these steps to open the sample project:
- In Cloud Pak for Data, from the Navigation menu , choose Projects > View all Projects.
- Click SPSS Modeler Project.
- Click the Assets tab to see the data sets and modeler flows.
Check your progress
The following image shows the project Assets tab. You are now ready to work with the sample modeler flow associated with this tutorial.
Task 2: Examine the Data Asset and Type nodes
Automated Data Preparation includes several nodes. Follow these steps to examine the Data Asset and Type nodes:
- From the Assets tab, open the Automated Data Preparation modeler flow, and wait for the canvas to load.
- Double-click the telco.csv node. This node is a Data Asset node that points to the telco.csv file in the project.
- Review the File format properties.
- Optional: Click Preview data to see the full data set.
- Double-click the Type node. Notice that the measure for the
churn
field is set to Flag, and the role is set to Target. Makesure that the role for all other fields is set to Input. - Optional: Click Preview data to see the data set with the Type properties applied.
Check your progress
The following image shows the Type node. You are now ready to build the model.
Task 3: Build the models
You will build two models, one model without and one model with automated data preparation. Follow these steps to build the models:
- Double-click the No ADP - churn node that is connected to the Type node to see its properties.
- Expand the Model Settings section
- Verify that the Procedure is set to Binomial.
- Verify that the Model Name is set to Custom, and the name is No ADP - churn.
- Hover over the No ADP - churn node, and click the Run icon .
- In the Outputs and models pane, click the model with the name No ADP - churn to
view the results.
- View the Model summary page, which shows predictor fields that are used by the model and the percentage of the predictions that are correct.
- View the Case Processing Summary, which shows the number and percentage of records that are included in the analysis. In addition, it lists the number of missing cases (if any) where one or more of the input fields are unavailable and any cases that were not selected.
- Close the model details.
- Double-click the Auto Data Prep node that is connected to the Type node to see its
properties. Automated Data Preparation handles the data preparation task for you, analyzing your
data and identifying fixes, screening out fields that are problematic or not likely to be useful,
deriving new attributes when appropriate, and improving performance through intelligent screening techniques.
- In the Objectives section, leave the default settings in place to analyze
and prepare your data by balancing both speed and accuracy. Other Auto Data Prep node
properties provide the option to specify that you want to concentrate more on accuracy, more on the
speed of processing, or to fine-tune many of the processing steps for data preparation.Note: If you want to adjust the node properties and run the flow again in the future, since the model already exists, you must first click Clear old analysis, under Objectives before running the flow again.
- Optional: Click Preview data to see the data set with the Auto Data Prep properties that are applied.
- Click Cancel.
- In the Objectives section, leave the default settings in place to analyze
and prepare your data by balancing both speed and accuracy. Other Auto Data Prep node
properties provide the option to specify that you want to concentrate more on accuracy, more on the
speed of processing, or to fine-tune many of the processing steps for data preparation.
- Double-click the After ADP - churn node that is connected to the Auto Data Prep
node to see its properties.
- Expand the Model Settings section
- Verify that the Procedure is set to Binomial.
- Verify that the Model Name is set to Custom, and the name is After ADP - churn.
- Hover over the After ADP - churn node, and click the Run icon .
- In the Outputs and models pane, click the model with the name After ADP - churn to
view the results.
- View the Model summary page, which shows predictor fields that are used by the model and the percentage of the predictions that are correct.
- View the Case Processing Summary, which shows the number and percentage of records that are included in the analysis. In addition, it lists the number of missing cases (if any) where one or more of the input fields are unavailable and any cases that were not selected.
- Close the model details.
Check your progress
The following image shows model details. You are now ready to compare the models.
Task 4: Compare the models
Now that both models are configured, follow these steps to generate and compare the models:
- Hover over the No ADP - LogReg (Analysis) node, and click the Run icon .
- Hover over the After ADP - LogReg (Analysis) node, and click the Run icon .
- In the Outputs and models pane, click the output results with the name No ADP - LogReg to view the results.
- Compare the models:
- Click Compare.
- In the Select output field, select After ADP - LogReg.
The analysis of the non-derived Auto Data Prep model shows that just running the data through the Logistic Regression node with its default settings gives a model with low accuracy - just 10.6%.The Analysis of the Auto-Data Prep-derived model shows that by running the data through the default Auto Data Prep settings, you have built a much more accurate model that's 78.3% correct.
Check your progress
The following image shows the model comparison.
Summary
By running the Auto Data Prep node to fine-tune the processing of your data, you were able to build a more accurate model with little direct data manipulation.
Obviously, if you're interested in proving or disproving a certain theory, or want to build specific models, you might find it beneficial to work directly with the model settings. However, if you have limited time or a large amount of data to prepare, the Auto Data Prep node may give you an advantage.
The results in this example are based on the training data only. To assess how well models generalize to other data in the real world, you can use a Partition node to hold out a subset of records for purposes of testing and validation.
Next steps
You are now ready to try other SPSS® Modeler tutorials.