Home >> Data Testing Concepts >> ETL Testing
ETL is commonly associated with Data Warehousing projects but in reality any form of bulk data movement from a source to a target can be considered ETL. Large enterprises often have a need to move application data from one source to another for data integration or data migration purposes. ETL testing is a data centric testing process to validate that the data has been transformed and loaded into the target as expected.
ETL Testing is different from application testing because it requires a data centric testing approach. Some of the challenges in ETL Testing are –
Although there are slight variations in the type of tests that need to be executed for each project, below are the most common types of tests that need to be done for ETL Testing.
The purpose of Metadata Testing is to verify that the table definitions conform to the data model and application design specifications.
Verify that the table and column data type definitions are as per the data model design specifications.
Example: Data Model column data type is NUMBER but the database column data type is STRING (or VARCHAR).
Verify that the length of database columns are as per the data model design specifications.
Example: Data Model specification for the ‘first_name’ column is of length 100 but the corresponding database table column is only 80 characters long.
Verify that proper constraints and indexes are defined on the database tables as per the design specifications.
Example 1: A column was defined as ‘NOT NULL’ but it can be optional as per the design.
Example 2: Foreign key constraints were not defined on the database table resulting in orphan records in the child table.
Verify that the names of the database metadata such as tables, columns, indexes are as per the naming standards.
Example: The naming standard for Fact tables is to end with an ‘_F’ but some of the fact tables names end with ‘_FACT’.
Compare table and column metadata across environments to ensure that changes have been migrated appropriately.
Example: A new column added to the SALES fact table was not migrated from the Development to the Test environment resulting in ETL failures.
ETL Validator comes with Metadata Compare Wizard for automatically capturing and comparing Table Metadata.
The purpose of Data Completeness tests are to verify that all the expected data is loaded in target from the source. Some of the tests that can be run are : Compare and Validate counts, aggregates (min, max, sum, avg) and actual data between the source and target.
Compare count of records of the primary source table and target table. Check for any rejected records.
Example: A simple count of records comparison between the source and target tables.
Source QuerySELECT count(1) src_count FROM customer
Target QuerySELECT count(1) tgt_count FROM customer_dim
Column or attribute level data profiling is an effective tool to compare source and target data without actually comparing the entire data. It is similar to comparing the checksum of your source and target data. These tests are essential when testing large amounts of data.
Some of the common data profile comparisons that can be done between the source and target are:
Example 1: Compare column counts with values (non null values) between source and target for each column based on the mapping.
Source Query
SELECT count(row_id), count(fst_name), count(lst_name), avg(revenue) FROM customer
Target Query
SELECT count(row_id), count(first_name), count(last_name), avg(revenue) FROM customer_dim
Example 2: Compare the number of customers by country between the source and target.
Source QuerySELECT country, count(*) FROM customer GROUP BY country
Target Query
SELECT country_cd, count(*) FROM customer_dim GROUP BY country_cd
Compare data (values) between the flat file and target data effectively validating 100% of the data. In regulated industries such as finance and pharmaceutical, 100% data validation might be a compliance requirement. It is also a key requirement for data migration projects. However, performing 100% data validation is a challenge when large volumes of data is involved. This is where ETL testing tools such as ETL Validator can be used because they have an inbuilt ELV engine (Extract, Load, Validate) capabile of comparing large values of data.
Example: Write a source query that matches the data in the target table after transformation.
Source Query
SELECT cust_id, fst_name, lst_name, fst_name||’,’||lst_name, DOB FROM Customer
Target Query
SELECT integration_id, first_name, Last_name, full_name, date_of_birth FROM Customer_dim
ETL Validator comes with Data Profile Test Case, Component Test Case and Query Compare Test Case for automating the comparison of source and target data.
The purpose of Data Quality tests is to verify the accuracy of the data. Data profiling is used to identify data quality issues and the ETL is designed to fix or handle these issue. However, source data keeps changing and new data quality issues may be discovered even after the ETL is being used in production. Automating the data quality checks in the source and target system is an important aspect of ETL execution and testing.
Look for duplicate rows with same unique key column or a unique combination of columns as per business requirement.
Example: Business requirement says that a combination of First Name, Last Name, Middle Name and Data of Birth should be unique.
Sample query to identify duplicates
SELECT fst_name, lst_name, mid_name, date_of_birth, count(1) FROM Customer GROUP BY fst_name, lst_name, mid_name HAVING count(1)>1
Many database fields can contain a range of values that cannot be enumerated. However, there are reasonable constraints or rules that can be applied to detect situations where the data is clearly wrong. Instances of fields containing values violating the validation rules defined represent a quality gap that can impact ETL processing.
Example: Date of birth (DOB). This is defined as the DATE datatype and can assume any valid date. However, a DOB in the future, or more than 100 years in the past are probably invalid. Also, the date of birth of the child is should not be greater than that of their parents.
This measurement addresses “keyed” relationships of entities within a domain. The goal of these checks is to identify orphan records in the child entity with a foreign key to the parent entity.
Example: In a data warehouse scenario, fact tables have foreign keys to the dimension tables. If an ETL process does a full refresh of the dimension tables while the fact table is not refreshed, the surrogate foreign keys in the fact table are not valid anymore. “Late arriving dimensions” is another scenario where a foreign key relationship mismatch might occur because the fact record gets loaded ahead of the dimension record.
Automate data quality testing using ETL Validator
ETL Validator comes with Data Rules Test Plan and Foreign Key Test Plan for automating the data quality testing.
Example: In a financial company, the interest earned on the savings account is dependent the daily balance in the account for the month.
- Review the requirement and design for calculating the interest.
- Implement the logic using your favourite programming language.
- Compare your output with data in the target table.
Example: In a financial company, the interest earned on the savings account is dependent the daily balance in the account for the month.Automate data transformation testing using ETL Validator ETL Validator comes with Component Test Case which can be used to test transformations using the White Box approach or the Black Box approach.
- Review the requirement for calculating the interest.
- Setup test data for various scenarios of daily account balance in the source system.
- Compare the transformed data in the target table with the expected values for the test data.
Example 1: The length of a comments column in the source database was increased but the ETL development team was not notified. Data started getting truncated in production data warehouse for the comments column after this change was deployed in the source system.
Example 2: One of the index in the data warehouse was dropped accidentally which resulted in performance issues in reports.
Example: In the data warehouse scenario, ETL changes are pushed on a periodic basis (eg. monthly). The tester is tasked with regression testing the ETL. By following the steps outlined above, the tester can regression test key ETLs.Automate ETL regression testing using ETL Validator ETL Validator comes with a Baseline and Compare Wizard which can be used to generate test cases for automatically baselining your target table data and comparing them with the new data. Using this approach any changes to the target data can be identified. ETL Validator also comes with Metadata Compare Wizard that can be used to track changes to Table metadata over a period of time. This helps ensure that the QA and development teams are aware of the changes to table metadata in both Source and Target systems.
Example: Values in the country_code column should have a valid country code from a Country Code domain.
select distinct country_code from address minus select country_code from country
Example: Compare Country Codes between development, test and production environments.
Example: A new country code has been added and an existing country code has been marked as deleted in the development environment without the approval or notification to the data steward.
ETL process is generally designed to be run in a Full mode or Incremental mode. When running in Full mode, the ETL process truncates the target tables and reloads all (or most) of the data from the source systems. Incremental ETL only loads the data that changed in the source system using some kind of change capture mechanism to identify changes. Incremental ETL is essential to reducing the ETL run times and it is often used method for updating data on a regular basis. The purpose of Incremental ETL testing is to verify that updates on the sources are getting loaded into the target system properly.
While most of the data completeness and data transformation tests are relevant for incremental ETL testing, there are a few additional tests that are relevant. To start with, setup of test data for updates and inserts is a key for testing Incremental ETL.
When a source record is updated, the incremental ETL should be able to lookup for the existing record in the target table and update it. If not this can result in duplicates in the target table.
Example: Business requirement says that a combination of First Name, Last Name, Middle Name and Data of Birth should be unique.
Sample query to identify duplicatesSELECT fst_name, lst_name, mid_name, date_of_birth, count(1) FROM Customer GROUP BY fst_name, lst_name, mid_name HAVING count(1)>1.
Verify that the changed data values in the source are reflecting correctly in the target data. Typically, the records updated by an ETL process are stamped by a run ID or a date of the ETL run. This date can be used to identify the newly updated or inserted records in the target system. Alternatively, all the records that got updated in the last few days in the source and target can be compared based on the incremental ETL run frequency.
Example: Write a source query that matches the data in the target table after transformation.
Source QuerySELECT fst_name||’,’||lst_name FROM Customer where updated_dt>sysdate-7
Target Query
SELECT full_name FROM Customer_dim where updated_dt>sysdate-7
Denormalization of data is quite common in a data warehouse environment. Source data is denormalized in the ETL so that the report performance can be improved. However, the denormalized values can get stale if the ETL process is not designed to update them based on changes in the source data.
Example: The Customer dimension in the data warehouse is denormalized to have the latest customer address data. However, the incremental ETL for the Customer Dim was not designed to update the latest address data when the customer updates their address because it was only designed to handle the Change Capture on the Customer source table and not the Customer_address table. The Customer address shown in the Customer Dim was good when a Full ETL was run but as the Customer Address changes come in during the Incremental ETL, the data in the Customer Dim became stale.
Source Query
SELECT cust_id, address1, address2, city, state, country FROM Customer SELECT cust_id, address1, address2, city, state, country,
ROW_NUMBER( ) OVER (PARTITION BY cust_id ORDER BY created_date NULLS LAST) addr_rank
FROM Customer
WHERE ROW_NUMBER( ) OVER (PARTITION BY cust_id ORDER BY created_date NULLS LAST) = 1
Target Query
SELECT cust_id, address1, address2, city, state, country FROM Customer_dim
While there are different types of slowly changing dimensions (SCD), testing of and SCD Type 2 dimension presently a unique challenge since there can be multiple records with the same natural key. Type 2 SCD is designed to create a new record whenever there is a change to a set of columns. The latest record is tagged with a flag and there are start date and end date columns to indicate the period of relevance for the record. Some of the tests specific to a Type 2 SCD are listed below:
ETL Validator comes with Benchmarking Capability in Component Test Case for automating the incremental ETL testing. Benchmarking capability allows the user to automatically compare the latest data in the target table with a previous copy to identify the differences. These differences can then be compared with the source data changes for validation.
Example: Let’s consider a data warehouse scenario for Case Management analytics using OBIEE as the BI tool. An executive report shows the number of Cases by Case type in OBIEE. However, during testing when the number of cases were compared between the source, target (data warehouse) and OBIEE report, it was found that each of them showed different values. As part of this testing it is important to identify the key measures or data values that can be compared across the source, target and consuming application.
Example 1: A lookup might perform well when the data is small but might become a bottle neck that slowed down the ETL task when there is large volume of data. What can make it worse is that the ETL task may be running by itself for hours causing the entire ETL process to run much longer than the expected SLA.
Example 2: An incremental ETL task was updating more records than it should. When the data volumes were low in the target table, it performed well but when the data volumes increased, the updated slowed down the incremental ETL tremendously.
Reduce your data testing costs dramatically with ETL Validator. Download your 30 day free trial now.
Try ETL Validator free for 30 days or contact us for a demo.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.