Data imputation in AutoAI experiments
Data imputation is the means of replacing missing values in your data set with substituted values. If you enable imputation, you can specify how missing values are interpolated in your data.
Imputation by experiment type
Imputation methods depend on the type of experiment that you build.
- For classification and regression you can configure categorical and numerical imputation methods.
- For timeseries problems, you can choose from a set of imputation methods to apply to numerical columns. When the experiment runs, the best performing method from the set is applied automatically. You can also specify a specific value as a replacement value.
Enabling imputation
To view and set imputation options:
- Click Experiment settings when you configure your experiment.
- Click the Data source option.
- Click Enable data imputation. Note that if you do not explicitly enable data imputation but your data source has missing values, AutoAI warns you and applies default imputation methods. See imputation details.
- Select options in the Imputation section.
- Optionally set a threshold for the percentage of imputation acceptable for a column of data. If the percentage of missing values exceeds the specified threshold, the experiment fails. To resolve, update the data source or adjust the threshold.
Configuring imputation for classification and regression experiments
Choose one of these methods for imputing missing data in binary classification, multiclass classification, or regression experiments. Note that you can have one method for completing values for text-based (categorical) data and another for numerical data.
Method | Description |
---|---|
Most frequent | Replace missing value with the value that appears most frequently in the column. |
Median | Replace missing value with the value in the middle of the sorted column. |
Mean | Replace missing value with the average value for the column. |
Configuring imputation for timeseries experiments
Choose some or all of these methods. When multiple methods are selected, the best-performing method is automatically applied for the experiment.
Method | Description |
---|---|
Cubic | Uses cubic interpolation by using pandas/scipy method to fill missing values. |
Fill | Choose value as the type to replace the missing values with a numeric value you specify. |
Flatten iterative | Data is first flattened and then the Scikit-learn iterative imputer is applied to find missing values. |
Linear | Use linear interpolation by using pandas/scipy method to fill missing values. |
Next | Replace missing value with the next value. |
Previous | Replace missing value with the previous value. |
Next steps
Data imputation implementation details for time series experiments
Parent topic: AutoAI overview