0 / 0
IBM DataStage for Cloud Pak for Data connection
Last updated: Dec 03, 2024
IBM DataStage for Cloud Pak for Data connection

To access your data in DataStage for Cloud Pak for Data, create a connection asset for it.

DataStage for Cloud Pak for Data is a data integration tool for designing, developing, and running jobs that move and transform data.

Create a connection to DataStage for Cloud Pak for Data

To create the connection asset, you need the following connection details:

  • Hostname or IP address
  • Port number
  • Credentials: specify username and password or API key
  • SSL certificate (if required by the database server). For example, specify a certificate if your DataStage for Cloud Pak for Data instance is not on the same cluster as the project where you want to create a metadata import job.

Choose the method for creating a connection based on where you are in the platform

In a project
Click Assets > New asset > Connect to a data source. See Adding a connection to a project.

In the Platform assets catalog
Click New connection. See Adding platform connections.

Next step: Import lineage metadata from the connection

Run metadata import to import lineage data and see how data is connected on the lineage graph. See Importing metadata.

Where you can use this connection

You can use the DataStage for Cloud Pak for Data connection in the following workspaces and tools:

Catalogs

  • Platform assets catalog

Data lineage

  • Metadata import (lineage) (IBM Knowledge Catalog and Manta Data Lineage)

Configuring lineage metadata import for DataStage for Cloud Pak for Data

When you create a metadata import for the DataStage for Cloud Pak for Data connection, you can set options specific to this data source, and define the scope of data for which lineage is generated. For details about metadata import, see Designing metadata imports.

To import lineage metadata for DataStage for Cloud Pak for Data, complete these steps:

  1. Create a data source definition. Select IBM DataStage for Cloud Pak for Data as the data source type.
  2. Create a connection to the data source in a project.
    • To connect to DataStage for Cloud Pak for Data, provide a username and a password.
    • To connect to DataStage for Cloud Pak for Data as a Service, provide the API key. If you do not have one, from the navigation menu go to Administration > Access (IAM) > Manage identities > API keys and create a new API key. Use the token value in the connection details.
    • Specify a certificate if your DataStage for Cloud Pak for Data instance is not on the same cluster as the project where you want to create a metadata import job.
  3. Create a metadata import. Learn more about options that are specific to DataStage for Cloud Pak for Data data source:
    • When you define a scope, you can analyze the entire data source or use the include and exclude options to define the exact projects and flows that you want to be analyzed. See Include and exclude lists.
    • Optionally, you can provide external input in the form of a .zip file. You add this file in the Add inputs from file field. You can decide to add external input in addition to defined scope of extracted data, or you can import data from the external input only. To add an external input, complete these steps:
    1. Prepare a .zip file as an external input.
    2. Upload the .zip file to the project.
    3. Configure the import to use only the external input.

Include and exclude lists

You can include or exclude assets up to the flow level. Provide databases and schemas in the format project/flow. Each part is evaluated as a regular expression. Assets which are added later in the data source will also be included or excluded if they match the conditions specified in the lists. Example values:

  • myProject/: all flows in myProject project.
  • myProject3/myFlow1: myFlow1 flow from myProject3 project.

Prepare a .zip file as an external input

You can provide DataStage flows as external inputs in a .zip file. The folder must have the following structure:

  • <project_export.zip> - A DataStage project exported to a .zip file.
  • DSParams - A file that contains the project- or environment-level parameters if applicable. You can get this file from the project directory.
  • datastageParameterOverride.txt - A file with parameter-set overrides if your jobs use parameter sets.
  • connection_definition/odbcConnectionDefinition.ini - A file with connection definitions for ODBC connections. Definitions of ODBC connections are not included in the DataStage XML exports and must be specified separately.
  • datastageComponentOverrides.csv - A file with component-lineage overrides.
Note: Even when you want to use only a project in a .zip file, this project .zip file must be compressed to another .zip file for the upload in the project.

The format of the .zip file with the exported DataStage project
When you export a DataStage project, it must have the following structure:

  • assets - required folder.
    • .METADATA - required folder.
    • data_intg_flow.*.json - required files that contain information about flows.
    • connection.*.json - optional files that contain information about connections.
    • parameter_set.*.json - optional files that contain information about parameter sets.
    • job.*.json - optional files that contain information about jobs.
    • job_run.*.json - optional files that contain information about particular executions of the job.
    • data_intg_flow - required folder.
    • At least one file that contains the string "schemas":[{, but does not end in px_executables.
  • assettypes - required folder.
  • project.json - required file. There might be multiple instances of this file as a result of ZIP decompression, which is correct.

The datastageParameterOverride.txt file format
The datastageParameterOverride.txt file has the following content:

[ENVIRONMENT]
PARAM1_NAME = "param1_value"
PARAM2_NAME = "param2_value"
PARAM3_NAME = "param3_value"
[PARAMETER_SET/parameter_set_name]
param4_name  = "default_param4_value"
param5_name  = "default_param5_value"
$PARAM3_NAME = "$PROJDEF"
[VALUE_FILE/parameter_set_name/value_file1_name]
param4_name  = "some_param4_value"
param5_name  = "some_param5_value"
$PARAM3_NAME = "some_param3_value"
[VALUE_FILE/parameter_set_name/value_file2_name]
param4_name  = "other_param4_value"
param5_name  = "other_param5_value"
$PARAM3_NAME = "other_param3_value"
[JOB/job1_name]
param6_name = "param6_value"
param7_name = "param7_value"
[JOB/job2_name]
param7_name = "param8_value"

The connection_definition/odbcConnectionDefinition.ini file format
The connection_definition/odbcConnectionDefinition.ini file has the following content. Create a separate [Shortcut_Name] secion for each connection.

[<Shortcut_Name>]
Type=<connection_type>
Connection_String=<connection_string>
Server_Name=<server_name>
Database_Name=<database_name>
Schema_Name=<schema_name>
User_Name=<user_name>
  • Shortcut_Name: The name of the connection or data server that is used by the data integration tool.
  • connection_type: The type of data source.
  • connection_string: A JDBC connection string or any identification of the database such as the system ID (SID) or the host name.
  • server_name: The value depends on the type of data source:
    • Db2, Microsoft SQL Server, Netezza Performance Server, SAP ASE (formerly Sybase), or Teradata: The server name.
    • FTP: The hostname.
    • Oracle and other databases: The value is ignored.
  • database_name: The value depends on the type of data source:
    • Oracle: The global database name.
    • Db2, Microsoft SQL Server, Netezza Performance Server, SAP ASE (formerly Sybase), Teradata, and other databases: The name of the default database.
    • user_name: The name of the user that logs in to the database.

Add a new line at the end of the parameters for each section.

The datastageComponentOverrides.csv file format
The datastageComponentOverrides.csv file has the following content:

"Full path to Stage";"Input Link name";"Input Column name";"Output Link name";"Output Column name";"Edge Type (DIRECT | FILTER)";"Description (optional)"
"manual_mapping_job/Generic_3";"DSLink2";"a";"DSLink5";"b";"DIRECT";""

The path to the stage is in the format Job/[Shared and Local containers optional]/Stage.

Upload the .zip file to the project

To use the .zip file in the metadata import, you must add it to the project where you create the metadata import.

  1. In the project, click Import assets.
  2. In the Local file section, click Data asset.
  3. Add the .zip file with DataStage project.

When you create the metadata import, you will be able to select this file in the Add inputs from file step.

Configure the import to use only the external input

If you want to import metadata only from the provided external input, and not directly from the connected DataStage for Cloud Pak for Data instance, complete these steps:

  1. Add the .zip file in the Add inputs from file section and click Next.
  2. Expand the Lineage import phases list, and disable the Transformations extraction phase.

Advanced import options

Analyze job runs
Specifies whether job runs are analyzed.
Analyze job runs since​
Specifies the date after which runs are analyzed. If the value is empty, all runs are analyzed. Example value: 1970/01/01 00:00:00.000.
Analyze jobs separately
Specifies whether to analyze job separately, even when other runs are associated with them.
Analyze flows without jobs
Specifies whether flows without jobs are analyzed.
Oracle proxy user authentication​
You can use Oracle proxy user authentication. Set the value to true to change Oracle usernames in \"USERNAME[SCHEMA_OWNER]\" format to \"SCHEMA_OWNER\" format. In other cases, set the value to false.
Value files
Specify the names of value files to use in Parameter Sets in order of priority. For example, DEV1,TEST,PROD.

Learn more

Importing metadata

Parent topic: Supported connections

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more