This article discusses how a migration from Oracle to Azure Synapse data warehouse can be validated from an End-to-End perspective.

Why migrate from Oracle to Azure Synapse Data Warehouse?

any organizations continue to use Oracle database for storing transactional and warehouse data in their on-premise data centers. However, with data growing exponentially, their data environments are struggling to handle the growing load which requires more processing power and increased storage, adding to data costs. So, migrating Oracle data to the cloud is an extremely viable option for optimizing costs and performance, and migrating the Oracle database to Azure SQL Server (or Azure Synapse) is a popular alternative.

Migration from Oracle to Azure Synapse DWH can be justified mainly by below:

    • Reduced Licensing Costs (especially post exponential upscaling, by virtualizing servers, database consolidation, and compression of data)
    • Superior Performance
    • Easier Implementation of In-Database Analytics

 If you looking to automate your ETL testing, try Datagaps ETL Validator ETL testing tool.

Request Free Trial

Primary Challenges in Migration and Points of Vulnerability

While Oracle and Azure Synapse both are RDMS with SQL as the main language, there are a few sets of differences that can cause issues during or post-migration.
    • Oracle Specific Features
      • If any features such as Package Variables, Collections (Nested Tables, Objects VARRAYS), System Packages, Wrapped Source Code, Row-level Triggers, and User-defined Types are used, they might not be properly translated in the Azure Synapse.
    • Data Type Differences
      • As both the systems interpret SQL ANSI standards slightly differently, there are variations in the data type conversions, especially in terms of numeric and date data types & formats
    • Migration Indexes
    • Performance Comparisons

Datatype Mapping

 

 
Azure Synapse  Oracle  Comment  
BIGINT  NUMBER (19)     
BINARY  RAW  –  
BIT  NUMBER (3)  –  
CHAR  CHAR  –  
DATETIME  DATE  Fractional parts of a second are truncated  
DECIMAL  NUMBER (p[, s])  –  
FLOAT  FLOAT (49)  –  
IMAGE  LONG RAW  –  
INTEGER  NUMBER (10)  NUMBER range is -2,147,483,647 to 2,147,483,647  
MONEY  NUMBER (19,4)  –  
NCHAR  NCHAR  –  
NTEXT  LONG  –  
NVARCHAR  NCHAR  –  
NUMERIC  NUMBER(p[,s])  –  
REAL  FLOAT (23)  –  
SMALL DATETIME  DATE  The value for seconds is returned as 0  
SMALL MONEY  NUMBER (10,4)  –  
SMALLINT  NUMBER (5)  NUMBER range is -32,767 to 32,767  
TEXT  LONG  –  
TIMESTAMP  RAW  –  
TINYINT  NUMBER (3)  –  
UNIQUEIDENTIFIER  CHAR (36)     
VARBINARY  RAW  –  
VARCHAR  VARCHAR2  –  

Pitfalls and Validation Techniques

Pitfalls/Drawback – Since retaining the accuracy of data is the utmost priority in loading data from a source to a target system, data needs to be verified and consistent. Therefore, there could be the following downfall during the overall process of ETL which may cause businesses to lose in different aspects if it is not cared for, such as.

Cost Of Migration

Time To Market

Poor Data Quality

Read: Pitfalls of Cloud Data Migration

Validation Techniques – Migrating data is not simply a process of migrating the data from its source system to the destination system, but it is also a complex task of retaining the quality of the data which requires a dedicated validation team with quality assurance skills, trusted validation tools, and the expertise in this domain.

From the validation point of view, it means that the data must be validated thoroughly end-to-end along with migrating from the existing system to the new destination system successfully. Therefore, Validation of already migrated data could be achieved by following some standard steps.

Migration Validation Scope

Having a clear understanding and accurate analysis of the scope of the migration test as to when and what needs to be tested.

Understanding the source system and metadata.

The Subset of data to be validated. During the validation, one must have a proper understanding of the volume of data that is moved from the source database to the destination database.

This will help to validate and analyze the performance issue.

    • What type of data is it?
    • What format is the data in? Are there multiple formats that you need to plan for?
    • What attributes does this data have?

Data Transformation check

A Data Transformation check is a process used to validate if the data transformation is working in accordance with the mapping logic document and business rules to check if new tables or columns are created with the proper data types and lengths (as specified in the mapping document).

Data Integrity and Uniqueness check

A. Data Integrity check: In the data integrity validation, the key attributes are validated to check if the referential integrity between parent and child tables is mapped correctly and if any uncertain data is removed or loaded.

B. Data Uniqueness check: A data Uniqueness check is nothing but a data duplication check, it is evaluated to verify the uniqueness of data of a table/entity as per the data model rule (as specified in the mapping document).  

Example: verify the customer table if it has authentic data and does not contain any duplicate records based on ‘Customer_Key’ as the key column. 

Data Completeness check

Traditionally, in the data warehouse, data completeness is evaluated through ETL testing that uses aggregate functions like (sum, max, min, count) to evaluate the average completeness of a column or record. Throughout the evaluation process, it also needs to be verified that the data in the target system is as per expectation after loading which can help in achieving data completeness.

  • Record count verification. 
  • Comparing source and target data sets. 
  • Aggregated records validation.  
Data Quality Check

Data Quality check is critical, in order to make sure that the data is correctly loaded into the destination tables/fields, it must be ensured that the application correctly rejects, substitutes default values, corrects issues, ignores unwanted characters, and reports invalid records if any.

Thus, by following the above 6 steps, data migration validation could be achieved with 100% error-free data without having any impediments to the business in decision-making.

Validation of Initial Data Migration

Initial Data Migration Validation, first and foremost before evaluating and checking the initial data migration, it is important to understand the meaning of Initial data migration. Thus, it basically truncates and loads the entire data into the data warehouse whenever an ETL process/ETL job is said to be run hence this process is known as an initial load or full load. Here, in this process, historical data is not maintained, and only current data is maintained in the data warehouse. The old data is erased and loaded with new data on the next ETL job run. This type of load is performed in type1 SCD (Slowly Changing Dimension).

Therefore, to validate the initial migration, follow the Validation Techniques as mentioned above.

Validation of ETL Changes

Since the data warehouse doesn’t contain any data initially, after full load/initial load when the ETL process happens for the second, third, or fourth time, data is loaded in an incremental fashion where changes are appended to the data warehouse to keep the historical/modified records.   

The Insertion of a record happens only if a record doesn’t already exist in the data warehouse else the update is done only for modified records.

Validate data sources — Perform a data count check and verify that the table and column data type meet the specifications of the data model. Make sure check keys are in place and remove duplicate data. If not done correctly, the aggregate report could be inaccurate or misleading.  

Validate constraints – This involves validating the constraints and making sure they are applied on the expected tables. This involves checking the misuse of integrity constraints like Foreign Key.  

Validate nulls – This involves validating the full data set in the source and the target tables by using the minus query. We need to perform both source minus target and target minus source. If the minus query returns a value, then that should be considered as mismatched rows.

Validation approach for SCD types

What are SCDs – Slowly Changing Dimensions?

    • A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse.
    • This applies to cases where the attribute for a record varies over time.

There are in general ways to solve this type of problem, and they are categorized as follows.

SCD Type 1:

SCD 1, is used to keep the latest data, the old data is overwritten. In other words, no historical data is managed.

Advantage – This is the easiest way to handle the Slowly Changing Dimension problem as there is no requirement to keep track of the old/historical records (data).

Disadvantages – By applying this methodology, it is not possible to keep track back in the history of changes made periodically for a record, for example in the below example the company/organization will not be able to know if James was in New York earlier.

Example:

Source Table: Customer

 
Customer_Key 🔑 Name City 
10001 James New York 
10002 Luis Arizona 
10003 Alexander Texas 

 

After Migration:  

Let’s assume customer James has moved from New York City to California, in this case, SCD Type 1 will overwrite the existing record with the newly updated record as shown in the below dimension table. 

 Target Table:  Customer_Dim 

 
Customer_Key 🔑 Name City 
10001 James California 
10002 Luis Arizona 
10003 Alexander Texas 

Validation to be performed:

Count Check – Validate that the count of updated records in the target table matches with the updated count’s records in the source table.

Duplicate check – Validate that the uniqueness of key columns in the target(dimension) is maintained.

Transformation Check – Validate that the transformation logic applied to any individual columns is transformed correctly and the records are getting loaded as per the mapping document.

Referential Integrity Check – Validate that the parent table (Primary Key column) has the same number of counts as the child table (Foreign Key column).

When to use SCD Type 1:  

A Slowly Changing Dimension Type 1 should be used only when it is not necessary for the data warehouse to keep track of historical changes.

SCD Type 2:

SCD Type 2, is used to keep the history of old data by adding a new record in the target table if any record was updated in the source table. Therefore, both the original and the new record will be present, and the new record gets its own primary key.

Advantages – SCD type 2 allows us to accurately keep all historical information.  

Disadvantage – This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.

Example:

Source Table: Customer  

 
Customer_Key 🔑 Name City 
10001 James New York 
10002 Luis Arizona 
10003 Alexander Texas 

  

After Migration:  

Let’s assume customer James has moved from New York City to California and from California moved to Bloomberg in this case, SCD Type 2 will add a new record with the existing record as shown in the below dimension table.

Target Table: Customer_Dim  

Customer_Key 🔑 Name City 
10001 James New York 
10002 Luis Arizona 
10003 Alexander Texas 
10004 James California 
10005 James Bloomberg 

Validation to be done: 

Delta Check – Verify whether the historical data is preserved, and whether new records are being created or not.

Duplicate Check – Verify the uniqueness of records (Minus Query).

Referential Integrity Check – Validate that the parent table (Primary Key column) has the same number of counts as the child table (Foreign Key column)

When to use SCD Type 2: 

SCD Type 2 should be used when it is necessary for the data warehouse to track the historical changes.

SCD Type 3:

In Type-3 Slowly Changing Dimension, a few technical columns such as Original_City, Current_City, and Effective_Date are added as new attributes to store the changed values and their cities.

Attribute Original_City indicates the original city (before update/changed city) of the customer, while attribute Current_City indicates the present city (latest update/changed city) of the customer. There will also be a column that indicates when the current value becomes active.

Advantages:

      • SCD Type 3, does not increase the size of the table since new information is updated in an individual attribute. 
      • This allows us to keep some part of history.

Note: SCD Type 3 is minimally used in an organization.  

Disadvantages – Type 3 will not be able to keep all historical changes where an attribute is changed more than once. For example, if James later moves to Texas on December 31, 2010, the California information will get lost.

Example:  

Source Table: Customer  

 
Customer_Key 🔑 Name City 
10001 James New York 
10002 Luis Arizona 
10003 Alexander Texas 

  

If any updates or changes are made in the source table, then the SCD Type 3 table will capture the records as shown below.  

Let’s assume, James moved from New York to California, the original information gets updated, and the records will get populated as the below table (assuming the effective date of change is January 01, 2010). 

 Target Table: Customer_Dim  

 
Customer_Key 🔑 Name Original_City Current_City Effective_Date 
10001 James New York California 01-01-2010 
10002 Luis Arizona     
10003 Alexander Texas     

Validation to be done:

fdkghdfjghjdfhgjhdfjhgjdfs

When to use SCD Type 3:

Type 3 Slowly Changing Dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur a finite(limited) number of times.

 

Conclusion:

Data warehouse migration validation is as important as getting rid of old technologies and migrating to cloud technology. Here, during the post-migration testing phase, first, the count of records in source vs target is compared, then it must be confirmed during the migration the records were transformed as per the business requirements (referring to mapping documents) which can harm the business potentially. 

Then, data reconciliation, Duplicate check, NULL check, and referential integrity check must be performed to ensure the quality of data. 

Datagaps
Established in the year 2010 with the mission of building trust in enterprise data & reports. Datagaps provides software for ETL Data Automation, Data Synchronization, Data Quality, Data Transformation, Test Data Generation, & BI Test Automation. An innovative company focused on providing the highest customer satisfaction. We are passionate about data-driven test automation. Our flagship solutions, ETL Validator, Data Flow and BI Validator are designed to help customers automate the testing of ETL, BI, Database, Data Lake, Flat File, & XML Data Sources. Our tools support Snowflake, Tableau, Amazon Redshift, Oracle Analytics, Salesforce, Microsoft Power BI, Azure Synapse, SAP BusinessObjects, IBM Cognos, etc., data warehousing projects, and BI platforms.  www.datagaps.com 

Queries: contact@datagaps.com