Creating a virtualized table from files in Cloud Object Storage in Data Virtualization

Last updated: Nov 26, 2024

In Data Virtualization, you can virtualize and use data that is stored as files on object storage.

You can virtualize data in files in the following object storages data sources:

IBM® Cloud Object Storage
Amazon S3
Ceph®
Generic S3

Segment or combine data from one or more files to create a virtual table.

Before you begin

To access data in cloud object storage, you must create a connection to the data source where the files are located. For more information about object storage data sources, see Data sources in object storage in Data Virtualization.

About this task

Watch the following video for an overview of virtualized files in Cloud Object Storage in Data Virtualization.

This video provides a visual method as an alternative to following the written steps in this documentation.

Procedure

On the navigation menu, click Data > Data virtualization.
The service menu opens to the Data sources page by default.
On the service menu, click Virtualization > Virtualize and then click the Files tab.
The list of available data sources appears. You can narrow down the displayed assets by using the available filters.
If you specified a bucket name when you set up the data source connection, click to expand the object storage connection details to see the Service type and Bucket information. If you didn't specify a bucket name when you set up the data source connection, you can use the Bucket input field to find a specific bucket in the endpoint.

If the list of data sources does not appear, click Refresh.
Select the object storage endpoint where you want to browse files and file paths.
For Cloud Object Storage, the endpoint is the URL for the object storage.
A list of file paths or buckets in the endpoint appears. You can navigate through the file path structure or you can click to see details of the contents of the first file in the file path.
You cannot add the bucket to your cart. To add a file path to your cart, you must select the file path to preview the files in the path, and then click Add to cart. You cannot select a file at the bucket level, you must add the file to a file path in the bucket.
Select the file or file path that you want to virtualize and click Add to cart.
Important:
- You cannot virtualize a single file in a file path that contains multiple files. The URL resolves to the parent path where the file is located and the entire file path is virtualized. If you want to virtualize a single file, you can move it to a separate file path. The separate file path must not be a sub-file path of any other file path that is, or will be, virtualized.
- Files that you want to virtualize must be within a file path and not at the same level as the bucket. For example, you cannot virtualize a file s3a://mynewbigsqlbucket/mydata.csv; you must put mydata.csv into a file path and virtualize s3a://mynewbigsqlbucket/fi1epath1/mydata.csv because the virtualization process cannot create an external table by using only a bucket name without a path.
Click View cart to preview your file data selections as a virtual file.
From this window, you can edit schema names, preview files that participate in a merged table, or remove a selection from your cart.
If you have IBM Knowledge Catalog installed, you can publish your virtual table to a catalog. For more information, see Publishing virtual data to the catalog in Data Virtualization.
Recommended: Update the type of partitioned columns from STRING to something more appropriate. Manually inspect and specify proper types for partitioning columns for best performance.
Optional: Click and select Edit columns.
You can edit any column name that is not tagged as a Partitioning column and change column types by using the drop-down menu. When you are happy with your edits, click Apply. Updated column names are shown after you virtualize the table.
Note: When you virtualize JSON files with Japanese data on IBM Cloud Object Storage and the Japanese column names are not displayed correctly, you can use the allownonalphanumeric option to view the virtualized Japanese column headers properly. This option is disabled by default and you must enable it. For more information, see Japanese column names are not displayed correctly in virtualized data.

Select the appropriate option to assign the virtual table to be created from file data:

Assign to	When to use this option
Project	Select Project if you created the virtual table to use in a specific project. Then, choose the appropriate project. The table also appears in Virtualized data.
Virtualized data	Select Virtualized data if the table was not created to use in a specific project. This setting is the default if no projects exist.

Select Publish to catalog if you also want to publish to a selected catalog.
A list of available catalogs is shown in the drop-down menu. Each catalog is tagged as Governed or Not governed.
Note: You must have at least one catalog in IBM Knowledge Catalog.
You must have permission to publish to a catalog. An administrator can enable whether all virtual objects are published to a selected governed catalog, which prevents a user from publishing to a specified catalog.
Specify a schema in the Schema field.
You can also create a schema by following these steps.
- If you have the Data Virtualization Engineer or User role, leave the Schema field as default to create a schema with your user ID.
- If you have the Data Virtualization Manager role, leave the Schema field as default to create a schema with your user ID or enter the new schema name in the Schema field.
For more information, see Creating schemas for virtual objects.
Click Virtualize to complete the process.
When the status window appears, you can select to view your virtualized data or virtualize more data.

What to do next

View the table structure and metadata.
Manage access to the table.
Edit column names and the types of your object storage assets so that you can prepare accurate data for virtualization.
Collect statistics for your virtualized table to optimize query performance. For more information, see Collecting statistics in Data Virtualization.
Optionally, on the Virtualized data page, publish your virtual object to the catalog. For more information, see Publishing virtual data to the catalog in Data Virtualization.