“A recent survey by TDWI found that 66% of organizations are looking for ways to improve data quality and trust. Data Validation using Data Quality testing tools such as Datagaps DataOps suite is essential to ensure trust in your data and analytics.”

 

The Data Quality dimensions provide a way to categorize data validation rules and measure data quality. There are seven data quality dimensions that are commonly used to measure data quality.

 

Completeness refers to the existence of all required attributes in the population of data records. Data element is:

  1. Always required (or)
  2. Required based on the condition of another data element.

Example:

  1. Person record with a null First Name
  2. The person record is missing a value for marital status. The married (Y/N) field should have a non-null value of ‘Y’ or ‘N’ but is populated with a “null” value instead.

Conformity means the data is following the set of standard data definitions like data type, size, and format. All data values conform to the requirements of their respective field.

Example:

  1. Date of Birth is listed as “26/05/1990” but should be in the format “mm/dd/yyyy“
  2. Zip Code contains letters but it should be numeric

Validity is determined by how closely data values correspond to reference tables, lists of golden values documented in metadata, and value ranges, etc. All data values are valid in relation to reference tables.

Example:

  1. Country Code should be a valid value from the reference data for countries
  2. Age for a Person should be less than 100 years old

Accuracy refers to the degree to which information accurately reflects what’s being described. It can be measured against either original documents or authoritative sources and validated against defined business rules.

Example:

  1. US Zip Codes should match a list of legal US postal codes
  2. Person name is spelled incorrectly

Uniqueness refers to the singularity of records and or attributes. The objective is a single (unique) recording of data. Data element is unique — there are no duplicate values.

Example:

  1. Each Person should only have one record, but there are two instances of the same Person with different identifiers or spellings.
  2. SSN should be unique, but there are two Person records that have the same social security number.

Consistency means data across all systems reflects the same information and are in synch with each other across the enterprise. The absence of difference, when comparing two or more representations of a thing against a definition.

Example:

  1. Employee status is terminated but pay status is active
  2. Employee start date cannot be later than the Employee end date
  3. N number for a Person record must be the same across systems

Timeliness references whether the information is available when it is expected and needed.

Example:

  1. For quarterly reporting, data must be up to date by the time the data is extracted
  2. Last Review Date for the policy must be within the last three years

How To Measure Data Quality Using Data Quality Dimensions?

Datagaps DataOps suite automatically computes Data Quality Scores for each rule based on the number of good vs bad data. This score is rolled up to Data Quality Dimensions at Table, Data Model, and System level.

A sample dashboard showing the Data Quality trend is shown below:

Data Quality Dimensions

Figure: Shows the Data Quality Dimensions and their scores at a system-level DataOps Data Quality comes with the following rule types to make it easy to define Data Validation rules:

Customized SQL Query to identify good or bad records.
Use Query Builder to provide a SQL Expression for identifying bad records.
Automatically Identifies duplicate records. Useful for measuring the Uniqueness of the data elements.
Check for Referential Integrity of data. Useful for measuring Data Integrity.
Check if data in a column conforms to the list of acceptable values. Useful for measuring the Validity of the data attributes.
Configurable rules for identifying data quality issues without the need to write any SQL queries.
A rule similar to the SQL Query rule but uses the ‘with’ clause (CTE) which makes it more efficient to execute.
Compare data from the previous snapshot of the data to the latest values.
DataOps dataflow & test cases can be used to compare data across different systems in the enterprise. This is useful in ensuring that data consistency can be maintained.

How To Compute Data Quality Score?

Data Quality Score allows us to quickly understand the current state of our data and more easily compare quality over time. Data Quality Scores will be a percent calculated by the following:

= [1 – (# of bad records / # total records)] x 100

Data Quality scores

Figure: Shows the trend of Data Quality scores at the system level

Conclusion

Automated Data Quality testing can be done using Data Validation rules. Data Quality Score provides a means to measure and track the Data Quality of your enterprise data at rest and in motion. Data Quality dimensions help categorize the data validation rules into meaningful buckets. DataOps Data Quality is a simple data validation testing tool that can be used to automate the Data Quality testing process.

Datagaps

Established in the year 2010 with the mission of building trust in enterprise data & reports. Datagaps provides software for ETL Data Automation, Data Synchronization, Data Quality, Data Transformation, Test Data Generation, & BI Test Automation. An innovative company focused on providing the highest customer satisfaction. We are passionate about data-driven test automation. Our flagship solutions, ETL Validator, Data Flow and BI Validator are designed to help customers automate the testing of ETL, BI, Database, Data Lake, Flat File, & XML Data Sources. Our tools support Snowflake, Tableau, Amazon Redshift, Oracle Analytics, Salesforce, Microsoft Power BI, Azure Synapse, SAP BusinessObjects, IBM Cognos, etc., data warehousing projects, and BI platforms.

 

Queries: contact@datagaps.com