If the majority of missing values are concentrated in a small number of fields, you can address them at the field level rather than at the record level. This approach also allows you to experiment with the relative importance of particular fields before deciding on an approach for handling missing values. If a field is unimportant in modeling, it probably isn't worth keeping, regardless of how many missing values it has.
For example, a market research company may collect data from a general questionnaire containing
50 questions. Two of the questions address age and political persuasion, information that many
people are reluctant to give. In this case, Age
and
Political_persuasion
have many missing values.
Field measurement level
In determining which method to use, you should also consider the measurement level of fields with missing values.
Numeric fields. For numeric field types, such as
Continuous
, you should always eliminate any non-numeric values before building a
model, because many models won't function if blanks are included in numeric fields.
Categorical fields. For categorical fields, such as
Nominal
and Flag
, altering missing values isn't necessary but will
increase the accuracy of the model. For example, a model that uses the field Sex
will still function with meaningless values, such as Y
and Z
, but
removing all values other than M
and F
will increase the accuracy
of the model.
Screening or removing fields
To screen out fields with too many missing values, you have several options:
- You can use a Data Audit node to filter fields based on quality
- You can use a Feature Selection node to screen out fields with more than a specified percentage of missing values and to rank fields based on importance relative to a specified target
- Instead of removing the fields, you can use a Type node to set the field role to None. This will keep the fields in the data set but exclude them from the modeling processes