Data is useful only if its quality is trusted and continuously evaluated. You can monitor the data quality of a data asset on its Data quality page.
The same information as on the Data quality tab is available when you click a data quality score in a metadata enrichment asset.
Requirements and restrictions
You can view data quality information for assets under the following circumstances.
Required services
Data quality requires the IBM Knowledge Catalog service. However, the data quality output from data quality rules is available only in the Dallas and Frankfurt regions. See Regional limitations for Cloud Pak for Data as a Service.
Required permissions
Your roles determine how you can interact with data quality:
- To view the Data quality page, you can have any collaborator role in the workspace.
- To change the way that the scores are calculated, you must have the Admin or Editor role in the project.
- To create new data quality checks, you must have the Admin or Editor role in the project and the Manage data quality assets permission.
- To view the data that caused data quality issues (the output table) from the the Data quality page, you must have the Drill down to issue details permission. However, the data asset in the project that is created for the output table is accessible by anyone who can access the connection. To limit access to this data asset, the connection to the data source where the output table is stored should be set up with personal credentials.
Workspaces
You can view data quality information in these workspaces:
- Projects
- Catalogs
Types of assets
These types of assets have data quality information:
- Data assets from relational or nonrelational databases from a connection to the data sources
- Data assets from partitioned data sets, where a partitioned data set consists of multiple files and is represented by a single folder uploaded from the local file system or from file-based connections to the data sources
- Data assets from files uploaded from the local file system or from file-based connections to the data sources, with these formats:
- CSV
- XLS, XLSM, XLSX (Only the first sheet in a workbook.)
- TSV
- Avro
- OCR
- Parquet
- IBM Match 360 entity data assets
Overview
On the Data quality page, you find information about a data asset's quality:
- The asset's overall data quality score. This is the weighted average of the scores provided by its columns. For more information, see Data quality scores.
- The scores for the individual dimensions. For each dimension, this is the weighted average of the corresponding dimension scores provided by the individual checks. The predefined data quality checks that are run as part of metadata enrichment have default dimensions assigned. See Predefined data quality checks. For data quality rules, you assign dimensions as required. For more information, see Data quality dimensions and Data quality scores.
- Trend information that shows how the overall quality or the quality score for a dimension changed over 30, 90, or 180 days. For more information, see Data quality analysis results.
- The list of data quality checks that were applied to the asset and their results. For more information, see Data quality analysis results.
- Data quality information for the individual columns. For more information, see Data quality analysis results.
The Data quality page in projects
The Data quality page is populated after the first data quality check is run on the data asset in one of these ways:
- Data quality analysis runs on the asset as part of metadata enrichment.
- A data quality rule runs on the asset.
- A connected IBM Match 360 entity data asset is added.
When an asset is imported from a catalog, only profile information is copied to the project. Data quality information is not copied.
The quality scores are recalculated and the data on this page is refreshed in these cases:
- Data quality analysis is run in the context of metadata enrichment.
- Data quality rules are run on the asset.
- A data quality rule that contributed to the scores is deleted. All issues that were returned by this data quality rule are removed.
- The asset profile is deleted on the asset's Profile page. All issues that were returned by the predefined data quality checks are removed.
The overall and dimension scores are also updated every time you change the Contributes to overall score setting for a check or a column. For more information, see Data quality scores.
You can immediately see when the quality scores were last updated.
In the Data quality checks section, you can see the following information:
- Which checks were run on the asset, sorted by date with the most recent checks at the top
- To which dimension each check is tied
- Whether a check was applied to the entire asset or to columns in the asset
- Information about the number of issues found
- Which kind of sampling was applied if any
- The data quality score that a check generated
- Whether the data quality score of a check is considered in the calculation of the overall asset score and the dimension scores
- When the check was last run
You can drill down into the results of each check except for IBM Match 360 matching. As a project administrator or editor, you can change for each check whether it contributes to the overall data quality score, and you can create new data quality checks. For more information, see Data quality analysis results.
You can switch between the Checks view and the Columns view. The Column overview section shows the following information for each column that was subject to any of the data quality checks:
- The column name
- The column's quality score for any of the dimensions that are applicable to the asset
- The number of checks that were run on a column
- Whether the column's data quality score is considered in the calculation of the overall asset score and the dimension scores
- When the column was last checked
You can then drill down into the data quality details for each column. As a project administrator or editor, you can also change for each column whether its quality score contributes to the overall data quality score. For more information, see Data quality analysis results.
The Data quality page in catalogs
The Data quality page is initially populated when a data asset that has data quality information is published to the catalog. The page is empty for any asset that you directly add as connected asset or that you upload from your local file system. To generate data quality information for such assets, add them to a project and run metadata enrichment or data quality rules on the assets. Then, publish them to the catalog.
The quality scores are updated and the data on this page is refreshed every time that the asset is published from a project with new data quality information.
You can immediately see when the quality scores were last updated.
The Data quality checks and Column overview sections provide the same information as the Data quality tab in the project. However, you can't drill down into check or column details.
Learn more
- Predefined data quality checks
- Data quality analysis results
- Data quality dimensions
- Data quality scores
Parent topic: Asset types and properties