You can use Balance nodes to correct imbalances in datasets so they conform to specified test criteria.
For example, suppose that a dataset has only two values--low
or high
--and that 90% of the cases are low
while only 10% of the
cases are high
. Many modeling techniques have trouble with such biased data because
they will tend to learn only the low outcome and ignore the high one, since it is more
rare. If the data is well balanced with approximately equal numbers of low
and
high
outcomes, models will have a better chance of finding patterns that
distinguish the two groups. In this case, a Balance node is useful for creating a balancing
directive that reduces cases with a low outcome.
Balancing is carried out by duplicating and then discarding records based on the conditions you specify. Records for which no condition holds are always passed through. Because this process works by duplicating and/or discarding records, the original sequence of your data is lost in downstream operations. Be sure to derive any sequence-related values before adding a Balance node to the data stream.