Use the Amazon S3 connector in DataStage®to connect to the Amazon Simple Storage Service
(S3) and perform various read and write
functions.
DataStage properties
In the Properties section of the Stage tab, select
Use DataStage properties to access properties that are specific for DataStage. These properties provide more features and
granular control of the flow execution, similar to the "optimized" connectors.
If you select Use DataStage properties with a .CSV file, the column
values must have double quotation marks around them. If any customization is needed, use the
connector File format properties to change the file format to
Delimited. Then, select the field delimiter, row delimiter, quote character,
and escape
character.
Clear Use DataStage properties to access the Table
format property selections.
Configuring the Amazon S3 connector as a
source
The available properties for the Read mode depend on whether you select
Use DataStage properties.
Configure the read process for when you select
Use DataStage properties
(default).
Table 1. Reading data from Amazon S3 with "Use DataStage properties" selected
Read mode |
Procedure |
Read a single file |
Specify the bucket name that contains the file, and then specify the name of the file to
read. |
Read multiple files |
- Specify the bucket name that contains the files.
- In the File name field, specify a prefix that the files you want to read
must have in their file path.
For example, if you enter transactions as
the prefix, the connector reads all the files in the transactions folder, such
as transactions/january/day1.txt, and a file named
transactions.txt.
|
List buckets |
No additional configuration is needed. |
List files |
- Specify the bucket name that contains the files.
-
Optional: In the File name field, specify a
prefix that the files you want to read must have in their file path.
For example, if you enter transactions as the prefix, the connector lists
all the files in the transactions folder, such as
transactions/january/day1.txt, and a file named
transactions.txt.
If you do not specify a file name prefix, all the files in the bucket container are listed.
|
Configure the read process for when you clear Use DataStage
properties.
Table 2. Reading data from Amazon S3 with "Use DataStage properties" not selected
Read mode |
Procedure |
Read a single file |
Specify the bucket name that contains the file, and then specify the name of the file to
read. |
Read binary data |
Specify the bucket name that contains the file, and then specify the name of the file to
read. |
Read binary data from multiple files by using wildcards |
Specify a wildcard character in the file name for binary data. For example, in
File name write test.*.gz. If you use this option,
you can read multiple binary files one after another, and each file will be read as a
record.
If you select Read a file to a row, you must provide two column
names in the Output tab of the source stage:
- The first column must be a string data type. This column is for the file name.
- The second column must be a binary data type. This column is for the file. The binary column
precision value must be greater than or equal to the maximum file size.
|
Read multiple files by using regex expression |
Specify the bucket name that contains the files. You can use a Java regex expression for the
file name. Examples:
^csv_write_datatypes_h.[0-9]$
csv_write_datatypes_h.[^12]
|
Read multiple files by using wildcards |
Specify an asterisk (*) to match zero or more characters. For example, specify
*.txt to match all files with the .txt extension. Specify a question mark
(?) to match one character.
Examples:
csv_write_datatypes.*
?_abc_test*
|
Configuring the Amazon S3 connector as a
target
The available properties for the Write mode depend on whether you select
Use DataStage properties.
Configure the write process for when you select
Use DataStage properties
(default).
Table 3. Writing data to Amazon S3 with "Use DataStage properties" selected
Write mode |
Procedure |
Delete a file |
- Specify the bucket name that contains the files or select Create
bucket.
- In the File name field, specify a file name to delete.
|
Write to a file |
- Specify the bucket name that contains the files.
- If you want to create a bucket that contains the files to write to, set Create
bucket option to Yes. Then you can select Append unique ID option
to append a unique set of characters to the bucket name that is created.
- In the File name field, specify a file name to write to.
- Choose one of three options in If file exists: Do not overwrite file,
Fail, or Overwrite file.
- In the Wavehandling section, you can choose an Append unique
identifier option. Use it to choose if a unique identifier is to be appended to the file
name. When set to Yes:
- The file name gets appended with the unique identifier, and a new file is written for every wave
of data that is streamed into the stage.
- The File size threshold option is enabled. Specify the threshold for the
file size in megabytes. Processing nodes will start a new file each time the size exceeds the
specified value.
When set to No, the file is overwritten on every wave.
- In the File attributes you can:
- Specify User metadata in a list of name-value pairs for example
Topic=News. Separate each name-value pairs with a semicolon, for example
Topic=Music;SubTopic=Pop.
- Choose one of three options in Server-side encryption: None, AES-256, or
AWS KMS.
- Choose the Storage class for the file: The reduced redundancy or
Standard.
- Specify the Content type of the file to write. For example,
text/xml or charset=utf-8.
- Set Define lifecycle rules option to Yes. Then you can choose the
Rule scope appliance for the file only or the files in the folder, and
Time period format to specify whether the lifecycle rule is based on the
number of days (Days from creation date) or based on specific date. You can set
Expiration option to Yes, and specify the number of days that the file will
exist. You can set the Archive option to Yes to specify whether to archive
the file in Amazon Glacier, and specify the date of archiving.
- Specify the amount of data in MB that the connector writes to Amazon S3 before the connector writes a progress message to
the job log in Interval for progress messages.
- Specify the Number of parallel writers.
- Specify the maximum Java Virtual Machine Heap size in megabytes.
|
Configure the write process for when you clear Use DataStage
properties.
Table 4. Writing data to Amazon S3 with "Use DataStage properties" not selected
Write mode |
Procedure |
Delete a file |
- Specify the bucket name that contains the files.
- In the Table action choose one of three options: Append, Replace, or
Truncate.
- In the File name field, specify a file name to delete.
|
Write to a file |
- Specify the bucket name that contains the files or select Create
bucket.
- In the Table action choose one of three options: Append, Replace, or
Truncate.
- In the Table format choose one of three options: Deltalake, Flat file, or
Iceberg. If you choose Flat file the Partitioned option is available, which
writes the file with multiple partitions.
- In the File name field, specify a file name to write to.
|
Write binary data |
- Specify the bucket name that contains the files or select Create
bucket.
- In the Table action choose one of three options: Append, Replace, or
Truncate.
- In the File name field, specify a file name to write to.
|