Automate data preparation

Last updated: Feb 11, 2025

This tutorial provides an example of preparing data for analysis. Preparing data is one of the most important steps in any data-mining project, and traditionally, one of the most time consuming. The Auto Data Prep node handles the task for you, analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques.

You can use the Auto Data Prep node in a fully automated fashion, allowing the node to choose and apply fixes, or you can preview the changes before they're made and accept or reject them. With this node, you can ready your data for data mining quickly and easily, without the need for prior knowledge of the statistical concepts involved. If you run the node with the default settings, models tend to build and score more quickly.

Preview the tutorial

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface that is shown in the video. The video is intended to be a companion to the written tutorial. This video provides a visual method to learn the concepts and tasks in this documentation.

Try the tutorial

In this tutorial, you will complete these tasks:

Task 1: Open the sample project
Task 2: Examine the Data Asset and Type nodes
Task 3: Build the models
Task 4: Compare the models

Sample modeler flow and data set

This tutorial uses the Automated Data Preparation flow in the sample project. The data file used is telco.csv. This example demonstrates the increased accuracy that you can find by using the default Auto Data Prep node settings when building models. The following image shows the sample modeler flow.

Auto Data Prep example flow — Figure 1. Sample modeler flow

The following image shows the sample data set.

Task 1: Open the sample project

The sample project contains several data sets and sample modeler flows. If you don't already have the sample project, then refer to the Tutorials topic to create the sample project. Then follow these steps to open the sample project:

In Cloud Pak for Data, from the Navigation menu , choose Projects > View all Projects.
Click SPSS Modeler Project.
Click the Assets tab to see the data sets and modeler flows.

Check your progress

The following image shows the project Assets tab. You are now ready to work with the sample modeler flow associated with this tutorial.

Sample project

Back to the top

Task 2: Examine the Data Asset and Type nodes

Automated Data Preparation includes several nodes. Follow these steps to examine the Data Asset and Type nodes:

From the Assets tab, open the Automated Data Preparation modeler flow, and wait for the canvas to load.
Double-click the telco.csv node. This node is a Data Asset node that points to the telco.csv file in the project.
Review the File format properties.
Optional: Click Preview data to see the full data set.
Double-click the Type node. Notice that the measure for the churn field is set to Flag, and the role is set to Target. Makesure that the role for all other fields is set to Input.
Figure 3. Set the measurement level and role
Optional: Click Preview data to see the data set with the Type properties applied.

Checkpoint icon Check your progress

The following image shows the Type node. You are now ready to build the model.

Back to the top

Task 3: Build the models

You will build two models, one model without and one model with automated data preparation. Follow these steps to build the models:

Double-click the No ADP - churn node that is connected to the Type node to see its properties.
1. Expand the Model Settings section
2. Verify that the Procedure is set to Binomial.
3. Verify that the Model Name is set to Custom, and the name is No ADP - churn.
  Figure 4. Logistic node Model Settings section
Hover over the No ADP - churn node, and click the Run icon .
In the Outputs and models pane, click the model with the name No ADP - churn to view the results.
1. View the Model summary page, which shows predictor fields that are used by the model and the percentage of the predictions that are correct.
2. View the Case Processing Summary, which shows the number and percentage of records that are included in the analysis. In addition, it lists the number of missing cases (if any) where one or more of the input fields are unavailable and any cases that were not selected.
3. Close the model details.
Double-click the Auto Data Prep node that is connected to the Type node to see its properties. Automated Data Preparation handles the data preparation task for you, analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques.
1. In the Objectives section, leave the default settings in place to analyze and prepare your data by balancing both speed and accuracy. Other Auto Data Prep node properties provide the option to specify that you want to concentrate more on accuracy, more on the speed of processing, or to fine-tune many of the processing steps for data preparation.
  Note: If you want to adjust the node properties and run the flow again in the future, since the model already exists, you must first click Clear old analysis, under Objectives before running the flow again.
2. Optional: Click Preview data to see the data set with the Auto Data Prep properties that are applied.
3. Click Cancel.
Double-click the After ADP - churn node that is connected to the Auto Data Prep node to see its properties.
1. Expand the Model Settings section
2. Verify that the Procedure is set to Binomial.
3. Verify that the Model Name is set to Custom, and the name is After ADP - churn.
Hover over the After ADP - churn node, and click the Run icon .
In the Outputs and models pane, click the model with the name After ADP - churn to view the results.
1. View the Model summary page, which shows predictor fields that are used by the model and the percentage of the predictions that are correct.
2. View the Case Processing Summary, which shows the number and percentage of records that are included in the analysis. In addition, it lists the number of missing cases (if any) where one or more of the input fields are unavailable and any cases that were not selected.
3. Close the model details.

Checkpoint icon Check your progress

The following image shows model details. You are now ready to compare the models.

Back to the top

Task 4: Compare the models

Now that both models are configured, follow these steps to generate and compare the models:

Hover over the No ADP - LogReg (Analysis) node, and click the Run icon .
Hover over the After ADP - LogReg (Analysis) node, and click the Run icon .
In the Outputs and models pane, click the output results with the name No ADP - LogReg to view the results.
Compare the models:
1. Click Compare.
2. In the Select output field, select After ADP - LogReg.
The analysis of the non-derived Auto Data Prep model shows that just running the data through the Logistic Regression node with its default settings gives a model with low accuracy - just 10.6%.
Figure 5. Non ADP-derived model results

The Analysis of the Auto-Data Prep-derived model shows that by running the data through the default Auto Data Prep settings, you have built a much more accurate model that's 78.3% correct.
Figure 6. ADP-derived model results

Checkpoint icon Check your progress

The following image shows the model comparison.

Back to the top

Summary

By running the Auto Data Prep node to fine-tune the processing of your data, you were able to build a more accurate model with little direct data manipulation.

Obviously, if you're interested in proving or disproving a certain theory, or want to build specific models, you might find it beneficial to work directly with the model settings. However, if you have limited time or a large amount of data to prepare, the Auto Data Prep node may give you an advantage.

The results in this example are based on the training data only. To assess how well models generalize to other data in the real world, you can use a Partition node to hold out a subset of records for purposes of testing and validation.

Next steps

You are now ready to try other SPSS® Modeler tutorials.

Was the topic helpful?

0/1000