At a recent TDWI virtual summit on “Data Integration and Data Quality”, I attended a session titled “Continuous Data Validation: Five Best Practices” by Andrew Cardno.
In this session, Andrew Cardno, one of the adjunct faculty at TDWI talked about the importance of validating data from the whole to the part, which means that the metrics or total should be validated before reconciling the detailed data or drill-downs. For example, revenue totals by product type should be the same in Finance, CRM, and Reporting systems.
Attending this talk reminded me of a Data Warehouse project I worked on at one of the federal agencies. The source system was a Case Management system with a Data Warehouse for reporting. We noticed that one of the key metrics “Number of Cases by Case Type” yielded different results when queried on the source database, the data warehouse, and the reports. Such discrepancies undermine the trust in the reports and the underlying data. The reason for the mismatch can be an unwanted filter or wrong join or error during the ETL process.
When it comes to the federal agency this report is sent to congress and they have a congressional mandate to ensure that the numbers are correct. For other industries such as Healthcare and Financial, compliance requirements require the data to be consistent across multiple systems in the enterprise. It is essential to reconcile the metrics and the underlying data across various systems in the enterprise.
Andrew talks about two primary methods for performing Data Validation testing techniques to help instill trust in the data and analytics.
- Glassbox Data Validation Testing
- Blackbox Data Validation Testing
I will go over these Data Validation testing techniques in more detail below and explain how the Datagaps DataOps suite can help automate Data Validation testing.
Glassbox Data Validation Testing
Glassbox data validation is in a way similar to the Whitebox testing in software testing. Whitebox testing in software engineering requires an understanding of the internal structure of the software. Glassbox data validation also requires the analyst to have a good understanding of the reports, the dataflows feeding the reports, and their dependencies. Based on this understanding, data validation tests can be written for comparing the data from the source to the target or reports.
Going back to the above example of “Number of Cases by Case Type” in a simplified version of the dataflow with a source (case management system), target (data warehouse) and reporting system, the numbers might look like below:
It is evident from the numbers that data between the Data Warehouse and the Source systems do not match at all. This is quite often the case because the transaction database is part of a live application where the data can get modified at any given time. Unless the ETL process to refresh the data warehouse is near real-time, the data in Data Warehouse will differ slightly from the transactional data expected. What we need to do is compare the numbers between the transactional data and data warehouse allowing for minor differences or variations. Assuming a 0.5% variance in numbers can still be considered as a pass, the same data comparison will result in the failures as shown below.
Datagaps DataOps suite and BI Validator can be used to automate this type of test. BI Validator has built-in integration to read the report data from popular BI tools such as Power BI, Tableau, etc. The Data Compare test case in BI Validator provides an easy-to-use wizard for reading the report data and comparing it with the SQL query output. Please reach out to us or download a trial to learn more information on BI Validator and DataOps suite.
Blackbox Data Validation Testing
Blackbox Data Validation like the name indicates is similar to Blackbox testing in the world of application testing. Instead of trying to understand the dataflows and how the data got to the target or reports, BlackBox testing is about validating the data using a set of data quality rules. For the previous example in a case management system, this might mean that validating the data in the data warehouse or the reports without comparing it with the source system. There are two broad categories of data validation testing that can be performed for this type of testing as listed below.
Data Quality Rules are SQL queries that validate the datasets for a predefined set of conditions to identify good or bad records. Data Observability is a new term that is being referred to as the usage of anomaly detection techniques for data validation.
- Data Quality testing using predefined Data Quality Rules
- Data Observability using AI-based anomaly detection
Data Validation using Data Quality Rules
Data Quality rules are data quality standards that can be validated using SQL queries in most cases. The simplest of all such rules is the “Not Null” rule. This rule checks if a particular column in a dataset has values for all the records. This rule can be categorized into a Data Quality dimension called “Completeness”. Some of the other types of rules include Duplicate checks, Referential Integrity checks, Delta rules to compare previous values to current values, and reference data checks. Data Quality and Data Governance go hand-in-hand.
What are Data Quality Dimensions?
Data Quality dimensions are a means to categorize Data Quality rules and measure data quality. Data Quality score is a measure of good data in the dataset that the rule is validating. There are seven Data Quality dimensions as listed below:
DataOps Data Quality makes it easy to define Data Quality rules and assess the quality of your datasets. It automatically computes a Data Quality score based on the results and displays a trend of Data Quality scores in a dashboard.
Data Observability and Anomaly Detection
While Data Quality Rules measure the quality of the data based on predefined criteria, Data Observability uses AI to identify anomalies in data based on historical data. Take, for example, the daily COVID cases for each city and the total number of COVID cases reported so far. When a data analyst looks at this data on a daily basis, it is easy to make the following observations:
- The number of cases on the weekend is generally less than on weekdays
- Each city has its own wave of increasing or decreasing number of cases on a daily basis
- The total number of cases has to be greater than or equal to the previous day’s number
It is difficult to predefine the range for the number of daily cases in a given city and create Data Quality rule. This is where Data Observability comes into play. Using AI-based algorithms or statistical-based methods such as Standard Deviation or IQR, the number of daily COVID cases can be automatically predicted. If the number of cases does not fall in the predicted range, the number can be considered an anomaly.
As you ingest incremental data into your Data Lake or Data Warehouse, it is important to use a combination of Data Quality rules and Data Observability to determine the quality of the data before it is processed further. It is easier and cheaper to fix the data quality issues while the data is getting ingested compared to fixing it after the data has been ingested into the data warehouse and data marts.
DataOps Data Quality can be used to automatically detect anomalies in the data being ingested. It uses apache spark and an AI-based time-series prediction algorithm to come with the prediction that takes into account the seasonality of the data.
A recent survey by TDWI found that 66% of organizations are looking for ways to improve data quality and trust. Using the aforementioned Glassbox data validation testing and Blackbox data validation testing can ensure that the data is consistent across systems and data quality is validated accurately. Datagaps DataOps suite is the most comprehensive data validation testing tool that can be used for Data Quality testing, Data Observability, and Data Reconciliation testing.
Established in the year 2010 with the mission of building trust in enterprise data & reports. Datagaps provides software for ETL Data Automation, Data Synchronization, Data Quality, Data Transformation, Test Data Generation, & BI Test Automation. An innovative company focused on providing the highest customer satisfaction. We are passionate about data-driven test automation. Our flagship solutions, ETL Validator, Data Flow and BI Validator are designed to help customers automate the testing of ETL, BI, Database, Data Lake, Flat File, & XML Data Sources. Our tools support Snowflake, Tableau, Amazon Redshift, Oracle Analytics, Salesforce, Microsoft Power BI, Azure Synapse, SAP BusinessObjects, IBM Cognos, etc., data warehousing projects, and BI platforms.