0 / 0

Sample stage in DataStage

Last updated: Mar 12, 2025
Sample stage in DataStage

The Sample stage samples an input data set.

The Sample stage can have a single input link and any number of output links when operating in percent mode, or a single input and single output link when operating in period mode. It is one of a number of stages that IBM DataStage provides to help you sample data, see also:

The Sample stage is a debug stage. It operates in two modes. In Percent mode, it extracts rows, selecting them by means of a random number generator, and writes a given percentage of these to each output data set. You specify the number of output data sets, the percentage written to each, and a seed value to start the random number generator. You can reproduce a given distribution by repeating the same number of outputs, the percentage, and the seed value.

In Period mode, it extracts every Nth row from each partition, where N is the period, which you supply. In this case all rows will be output to a single data set, so the stage used in this mode can only have a single output link

For both modes you can specify the maximum number of rows that you want to sample from each partition.

When you double click the Sample stage, the properties panel opens. The properties panel has three tabs:
  • Stage . This is always present and is used to specify general information about the stage.
  • Input. This is where you specify details about the data set being Sampled.
  • Output. This is where you specify details about the Sampled data being output from the stage.

Input tab

The Columns section specifies the column definitions of incoming data.

Output tab

In Percent mode, the stage can have any number of output links, in Period mode it can only have one output. Choose the link you want to work on from the Output Link drop down list.

The Columns section specifies the column definitions of outgoing data. Click Edit at the bottom of the Columns section to specify mapping information. Mapping specifies the relationship between the columns being input to the Sample stage and the output columns. The Advanced section allows you to change the default buffering settings for the output links.

Mapping output

Click Edit in the Columns section to map columns. View the columns of the sampled data. These are read only and cannot be modified on this tab. This shows the meta data from the incoming link

The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility.