Last updated: Jan 17, 2024
The Feature Selection node screens input fields for removal based on a set of criteria (such as the percentage of missing values); it then ranks the importance of remaining inputs relative to a specified target. For example, given a data set with hundreds of potential inputs, which are most likely to be useful in modeling patient outcomes?
Example
node = stream.create("featureselection", "My node")
node.setPropertyValue("screen_single_category", True)
node.setPropertyValue("max_single_category", 95)
node.setPropertyValue("screen_missing_values", True)
node.setPropertyValue("max_missing_values", 80)
node.setPropertyValue("criteria", "Likelihood")
node.setPropertyValue("unimportant_below", 0.8)
node.setPropertyValue("important_above", 0.9)
node.setPropertyValue("important_label", "Check Me Out!")
node.setPropertyValue("selection_mode", "TopN")
node.setPropertyValue("top_n", 15)
featureselectionnode Properties |
Values | Property description |
---|---|---|
target
|
field | Feature Selection models rank predictors relative to the specified target. Weight and frequency fields are not used. See Common modeling node properties for more information. |
screen_single_category
|
flag | If True , screens fields that have too many records falling into the same
category relative to the total number of records. |
max_single_category
|
number | Specifies the threshold used when screen_single_category is
True . |
screen_missing_values
|
flag | If True , screens fields with too many missing values, expressed as a
percentage of the total number of records. |
max_missing_values
|
number | |
screen_num_categories
|
flag | If True , screens fields with too many categories relative to the total
number of records. |
max_num_categories
|
number | |
screen_std_dev
|
flag | If True , screens fields with a standard deviation of less than or equal to
the specified minimum. |
min_std_dev
|
number | |
screen_coeff_of_var
|
flag | If True , screens fields with a coefficient of variance less than or equal to
the specified minimum. |
min_coeff_of_var
|
number | |
criteria
|
Pearson
Likelihood
CramersV
Lambda
|
When ranking categorical predictors against a categorical target, specifies the measure on which the importance value is based. |
unimportant_below
|
number | Specifies the threshold p values used to rank variables as important, marginal, or unimportant. Accepts values from 0.0 to 1.0. |
important_above
|
number | Accepts values from 0.0 to 1.0. |
unimportant_label
|
string | Specifies the label for the unimportant ranking. |
marginal_label
|
string | |
important_label
|
string | |
selection_mode
|
ImportanceLevel
ImportanceValue
TopN
|
|
select_important
|
flag | When selection_mode is set to ImportanceLevel , specifies
whether to select important fields. |
select_marginal
|
flag | When selection_mode is set to ImportanceLevel , specifies
whether to select marginal fields. |
select_unimportant
|
flag | When selection_mode is set to ImportanceLevel , specifies
whether to select unimportant fields. |
importance_value
|
number | When selection_mode is set to ImportanceValue , specifies
the cutoff value to use. Accepts values from 0 to 100. |
top_n
|
integer | When selection_mode is set to TopN , specifies the cutoff
value to use. Accepts values from 0 to 1000. |