0 / 0
Automate data preparation
Last updated: Dec 11, 2024
Automate data preparation

This tutorial provides an example of preparing data for analysis. Preparing data is one of the most important steps in any data-mining project, and traditionally, one of the most time consuming. The Auto Data Prep node handles the task for you, analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques.

You can use the Auto Data Prep node in a fully automated fashion, allowing the node to choose and apply fixes, or you can preview the changes before they're made and accept or reject them. With this node, you can ready your data for data mining quickly and easily, without the need for prior knowledge of the statistical concepts involved. If you run the node with the default settings, models tend to build and score more quickly.

Try the tutorial

In this tutorial, you will complete these tasks:

Sample modeler flow and data set

This tutorial uses the Automated Data Preparation flow in the sample project. The data file used is telco.csv. This example demonstrates the increased accuracy that you can find by using the default Auto Data Prep node settings when building models. The following image shows the sample modeler flow.

Figure 1. Sample modeler flow
Auto Data Prep example flow
The following image shows the sample data set.
Figure 2. Sample data set
Sample data set

Task 1: Open the sample project

The sample project contains several data sets and sample modeler flows. If you don't already have the sample project, then refer to the Tutorials topic to create the sample project. Then follow these steps to open the sample project:

  1. In Cloud Pak for Data, from the Navigation menu Navigation menu, choose Projects > View all Projects.
  2. Click SPSS Modeler Project.
  3. Click the Assets tab to see the data sets and modeler flows.

Checkpoint icon Check your progress

The following image shows the project Assets tab. You are now ready to work with the sample modeler flow associated with this tutorial.

Sample project

Back to the top

Task 2: Examine the Data Asset and Type nodes

Automated Data Preparation includes several nodes. Follow these steps to examine the Data Asset and Type nodes:

  1. From the Assets tab, open the Automated Data Preparation modeler flow, and wait for the canvas to load.
  2. Double-click the telco.csv node. This node is a Data Asset node that points to the telco.csv file in the project.
  3. Review the File format properties.
  4. Optional: Click Preview data to see the full data set.
  5. Double-click the Type node. Notice that the measure for the churn field is set to Flag, and the role is set to Target. Makesure that the role for all other fields is set to Input.
    Figure 3. Set the measurement level and role
    Set the measurement level and role
  6. Optional: Click Preview data to see the data set with the Type properties applied.

Checkpoint icon Check your progress

The following image shows the Type node. You are now ready to build the model.

Type node

Back to the top

Task 3: Build the models

You will build two models, one model without and one model with automated data preparation. Follow these steps to build the models:

  1. Double-click the No ADP - churn node that is connected to the Type node to see its properties.
    1. Expand the Model Settings section
    2. Verify that the Procedure is set to Binomial.
    3. Verify that the Model Name is set to Custom, and the name is No ADP - churn.
      Figure 4. Logistic node Model Settings section
      Choose model options
  2. Hover over the No ADP - churn node, and click the Run icon Run icon.
  3. In the Outputs and models pane, click the model with the name No ADP - churn to view the results.
    1. View the Model summary page, which shows predictor fields that are used by the model and the percentage of the predictions that are correct.
    2. View the Case Processing Summary, which shows the number and percentage of records that are included in the analysis. In addition, it lists the number of missing cases (if any) where one or more of the input fields are unavailable and any cases that were not selected.
    3. Close the model details.
  4. Double-click the Auto Data Prep node that is connected to the Type node to see its properties. Automated Data Preparation handles the data preparation task for you, analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques.
    1. In the Objectives section, leave the default settings in place to analyze and prepare your data by balancing both speed and accuracy. Other Auto Data Prep node properties provide the option to specify that you want to concentrate more on accuracy, more on the speed of processing, or to fine-tune many of the processing steps for data preparation.
      Note: If you want to adjust the node properties and run the flow again in the future, since the model already exists, you must first click Clear old analysis, under Objectives before running the flow again.
    2. Optional: Click Preview data to see the data set with the Auto Data Prep properties that are applied.
    3. Click Cancel.
  5. Double-click the After ADP - churn node that is connected to the Auto Data Prep node to see its properties.
    1. Expand the Model Settings section
    2. Verify that the Procedure is set to Binomial.
    3. Verify that the Model Name is set to Custom, and the name is After ADP - churn.
  6. Hover over the After ADP - churn node, and click the Run icon Run icon.
  7. In the Outputs and models pane, click the model with the name After ADP - churn to view the results.
    1. View the Model summary page, which shows predictor fields that are used by the model and the percentage of the predictions that are correct.
    2. View the Case Processing Summary, which shows the number and percentage of records that are included in the analysis. In addition, it lists the number of missing cases (if any) where one or more of the input fields are unavailable and any cases that were not selected.
    3. Close the model details.

Checkpoint icon Check your progress

The following image shows model details. You are now ready to compare the models.

Model details

Back to the top

Task 4: Compare the models

Now that both models are configured, follow these steps to generate and compare the models:

  1. Hover over the No ADP - LogReg (Analysis) node, and click the Run icon Run icon.
  2. Hover over the After ADP - LogReg (Analysis) node, and click the Run icon Run icon.
  3. In the Outputs and models pane, click the output results with the name No ADP - LogReg to view the results.
  4. Compare the models:
    1. Click Compare.
    2. In the Select output field, select After ADP - LogReg.
    The analysis of the non-derived Auto Data Prep model shows that just running the data through the Logistic Regression node with its default settings gives a model with low accuracy - just 10.6%.
    Figure 5. Non ADP-derived model results
    Non ADP-derived model results
    The Analysis of the Auto-Data Prep-derived model shows that by running the data through the default Auto Data Prep settings, you have built a much more accurate model that's 78.3% correct.
    Figure 6. ADP-derived model results
    ADP-derived model results

Checkpoint icon Check your progress

The following image shows the model comparison.

Compare models

Back to the top

Summary

By running the Auto Data Prep node to fine-tune the processing of your data, you were able to build a more accurate model with little direct data manipulation.

Obviously, if you're interested in proving or disproving a certain theory, or want to build specific models, you might find it beneficial to work directly with the model settings. However, if you have limited time or a large amount of data to prepare, the Auto Data Prep node may give you an advantage.

The results in this example are based on the training data only. To assess how well models generalize to other data in the real world, you can use a Partition node to hold out a subset of records for purposes of testing and validation.

Next steps

You are now ready to try other SPSS® Modeler tutorials.

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more