Oracle to Azure Synapse Data Warehouse Migration Validation

Oracle-to-Azure-SQL-Server-Migration-Validation

Oracle to azure synapse migration: This article discusses how a migration from Oracle to Azure Synapse data warehouse can be validated from an End-to-End perspective.

Why migrate from Oracle to Azure Synapse Data Warehouse?

any organizations continue to use Oracle database for storing transactional and warehouse data in their on-premise data centers. However, with data growing exponentially, their data environments are struggling to handle the growing load which requires more processing power and increased storage, adding to data costs. So, migrating Oracle data to the cloud is an extremely viable option for optimizing costs and performance, and migrating the Oracle database to Azure SQL Server (or Azure Synapse) is a popular alternative.

Migration from Oracle to Azure Synapse DWH can be justified mainly by below:

  • Reduced Licensing Costs (especially post exponential upscaling, by virtualizing servers, database consolidation, and compression of data)
  • Superior Performance
  • Easier Implementation of In-Database Analytics
If you looking to automate your ETL testing, try Datagaps ETL Validator ETL testing tool.

Primary Challenges in Migration and Points of Vulnerability

While Oracle and Azure Synapse both are RDMS with SQL as the main language, there are a few sets of differences that can cause issues during or post-migration.
  • Oracle Specific Features
    • If any features such as Package Variables, Collections (Nested Tables, Objects VARRAYS), System Packages, Wrapped Source Code, Row-level Triggers, and User-defined Types are used, they might not be properly translated in the Azure Synapse.
  • Data Type Differences
    • As both the systems interpret SQL ANSI standards slightly differently, there are variations in the data type conversions, especially in terms of numeric and date data types & formats
  • Migration Indexes
  • Performance Comparisons
Azure Synapse   Oracle   Comment  
BIGINT   NUMBER (19)      
BINARY   RAW   –  
BIT   NUMBER (3)   –  
CHAR   CHAR   –  
DATETIME   DATE   Fractional parts of a second are truncated  
DECIMAL   NUMBER (p[, s])   –  
FLOAT   FLOAT (49)   –  
IMAGE   LONG RAW   –  
INTEGER   NUMBER (10)   NUMBER range is -2,147,483,647 to 2,147,483,647  
MONEY   NUMBER (19,4)   –  
NCHAR   NCHAR   –  
NTEXT   LONG   –  
NVARCHAR   NCHAR   –  
NUMERIC   NUMBER(p[,s])   –  
REAL   FLOAT (23)   –  
SMALL DATETIME   DATE   The value for seconds is returned as 0  
SMALL MONEY   NUMBER (10,4)   –  
SMALLINT   NUMBER (5)   NUMBER range is -32,767 to 32,767  
TEXT   LONG   –  
TIMESTAMP   RAW   –  
TINYINT   NUMBER (3)   –  
UNIQUEIDENTIFIER   CHAR (36)      
VARBINARY   RAW   –  
VARCHAR   VARCHAR2   –  

Pitfalls and Validation Techniques

Pitfalls/Drawback – Since retaining the accuracy of data is the utmost priority in loading data from a source to a target system, data needs to be verified and consistent. Therefore, there could be the following downfall during the overall process of ETL which may cause businesses to lose in different aspects if it is not cared for, such as.
save-money

Cost Of Migration

time-left
Time To Market
data
Poor Data Quality
Validation Techniques – Migrating data is not simply a process of migrating the data from its source system to the destination system, but it is also a complex task of retaining the quality of the data which requires a dedicated validation team with quality assurance skills, trusted validation tools, and the expertise in this domain.
From the validation point of view, it means that the data must be validated thoroughly end-to-end along with migrating from the existing system to the new destination system successfully. Therefore, Validation of already migrated data could be achieved by following some standard steps.
Thus, by following the above 6 steps, data migration validation could be achieved with 100% error-free data without having any impediments to the business in decision-making.

Validation of Initial Data Migration

Initial Data Migration Validation, first and foremost before evaluating and checking the initial data migration, it is important to understand the meaning of Initial data migration. Thus, it basically truncates and loads the entire data into the data warehouse whenever an ETL process/ETL job is said to be run hence this process is known as an initial load or full load. Here, in this process, historical data is not maintained, and only current data is maintained in the data warehouse. The old data is erased and loaded with new data on the next ETL job run. This type of load is performed in type1 SCD (Slowly Changing Dimension).
Validate
Therefore, to validate the initial migration, follow the Validation Techniques as mentioned above.

Validation of ETL Changes

Since the data warehouse doesn’t contain any data initially, after full load/initial load when the ETL process happens for the second, third, or fourth time, data is loaded in an incremental fashion where changes are appended to the data warehouse to keep the historical/modified records.
The Insertion of a record happens only if a record doesn’t already exist in the data warehouse else the update is done only for modified records.
Validate data sources – Perform a data count check and verify that the table and column data type meet the specifications of the data model. Make sure check keys are in place and remove duplicate data. If not done correctly, the aggregate report could be inaccurate or misleading.
Validate constraints – This involves validating the constraints and making sure they are applied on the expected tables. This involves checking the misuse of integrity constraints like Foreign Key.
Validate nulls – This involves validating the full data set in the source and the target tables by using the minus query. We need to perform both source minus target and target minus source. If the minus query returns a value, then that should be considered as mismatched rows.

Validation approach for SCD types

What are SCDs – Slowly Changing Dimensions?

  • A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse.
  • This applies to cases where the attribute for a record varies over time.
There are in general ways to solve this type of problem, and they are categorized as follows.

SCD 1, is used to keep the latest data, the old data is overwritten. In other words, no historical data is managed.

Advantage – This is the easiest way to handle the Slowly Changing Dimension problem as there is no requirement to keep track of the old/historical records (data).

Disadvantages – By applying this methodology, it is not possible to keep track back in the history of changes made periodically for a record, for example in the below example the company/organization will not be able to know if James was in New York earlier.

Example:

Source Table: Customer

Customer_Key 🔑  Name  City 
10001  James  New York 
10002  Luis  Arizona 
10003  Alexander  Texas 

After Migration:

Let’s assume customer James has moved from New York City to California , in this case, SCD Type 1 will overwrite the existing record with the newly updated record as shown in the below dimension table.

Target Table: Customer_Dim

Customer_Key 🔑  Name  City 
10001  James  California 
10002  Luis  Arizona 
10003  Alexander  Texas 

Validation to be performed:

Count Check – Validate that the count of updated records in the target table matches with the updated count’s records in the source table./p>

Duplicate check – Validate that the uniqueness of key columns in the target(dimension) is maintained./p>

Transformation Check – Validate that the transformation logic applied to any individual columns is transformed correctly and the records are getting loaded as per the mapping document./p>

Referential Integrity Check – Validate that the parent table (Primary Key column) has the same number of counts as the child table (Foreign Key column)./p>

When to use SCD Type 1:

A Slowly Changing Dimension Type 1 should be used only when it is not necessary for the data warehouse to keep track of historical changes

SCD Type 2, is used to keep the history of old data by adding a new record in the target table if any record was updated in the source table. Therefore, both the original and the new record will be present, and the new record gets its own primary key.

Advantage – SCD type 2 allows us to accurately keep all historical information.

Disadvantages – This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.

Example:

Source Table: Customer

 
Customer_Key 🔑  Name  City 
10001  James  New York 
10002  Luis  Arizona 
10003  Alexander  Texas 

After Migration:

Let’s assume customer James has moved from New York City to California and from California moved to Bloomberg in this case, SCD Type 2 will add a new record with the existing record as shown in the below dimension table.

Target Table: Customer_Dim

Customer_Key 🔑  Name  City 
10001  James  New York 
10002  Luis  Arizona 
10003  Alexander  Texas 
10004  James  California 
10005  James  Bloomberg 

Validation to be done:

Delta Check – Verify whether the historical data is preserved, and whether new records are being created or not.

Duplicate check – Verify the uniqueness of records (Minus Query).

Referential Integrity Check – Validate that the parent table (Primary Key column) has the same number of counts as the child table (Foreign Key column)

When to use SCD Type 2:

SCD Type 2 should be used when it is necessary for the data warehouse to track the historical changes.

In Type-3 Slowly Changing Dimension, a few technical columns such as Original_City, Current_City, and Effective_Date are added as new attributes to store the changed values and their cities.

Attribute Original_City, indicates the original city (before update/changed city) of the customer, while attribute Current_City indicates the present city (latest update/changed city) of the customer. There will also be a column that indicates when the current value becomes active.

Advantage:

  • SCD Type 3, does not increase the size of the table since new information is updated in an individual attribute.
  • This allows us to keep some part of history.

Note: SCD Type 3 is minimally used in an organization.

Disadvantages – Type 3 will not be able to keep all historical changes where an attribute is changed more than once. For example, if James later moves to Texas on December 31, 2010, the California information will get lost.

Example:

Source Table: Customer

 
Customer_Key 🔑 Name City 
10001 James New York 
10002 Luis Arizona 
10003 Alexander Texas 

If any updates or changes are made in the source table, then the SCD Type 3 table will capture the records as shown below.

Let’s assume, James moved from New York to California , the original information gets updated, and the records will get populated as the below table (assuming the effective date of change is January 01, 2010).

Target Table: Customer_Dim

Customer_Key 🔑 Name Original_City Current_City Effective_Date 
10001 James New York California 01-01-2010 
10002 Luis Arizona     
10003 Alexander Texas     

Validation to be done:

fdkghdfjghjdfhgjhdfjhgjdfs

When to use SCD Type 3:

Type 3 Slowly Changing Dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur a finite(limited) number of times.

Conclusion:

Data warehouse migration validation is as important as getting rid of old technologies and migrating to cloud technology. Here, during the post-migration testing phase, first, the count of records in source vs target is compared, then it must be confirmed during the migration the records were transformed as per the business requirements (referring to mapping documents) which can harm the business potentially.
Then, data reconciliation, Duplicate check, NULL check, and referential integrity check must be performed to ensure the quality of data.
Datagaps-logo-1536x406-1

Established in the year 2010 with the mission of building trust in enterprise data & reports. Datagaps provides software for ETL Data Automation, Data Synchronization, Data Quality, Data Transformation, Test Data Generation, & BI Test Automation. An innovative company focused on providing the highest customer satisfaction. We are passionate about data-driven test automation. Our flagship solutions, ETL ValidatorDataFlow, and BI Validator are designed to help customers automate the testing of ETL, BI, Database, Data Lake, Flat File, & XML Data Sources. Our tools support Snowflake, Tableau, Amazon Redshift, Oracle Analytics, Salesforce, Microsoft Power BI, Azure Synapse, SAP BusinessObjects, IBM Cognos, etc., data warehousing projects, and BI platforms.  www.datagaps.com 

Related Posts:

Data Quality

Automate testing of Business Intelligence applications by making use of the metadata available from the BI tools such as Tableau, OBIEE, and Business Objects.

Synthetic Data

Automate testing of Business Intelligence applications by making use of the metadata available from the BI tools such as Tableau, OBIEE, and Business Objects.

ETL Testing

Automate testing of Business Intelligence applications by making use of the metadata available from the BI tools such as Tableau, OBIEE, and Business Objects.

BI Validation

Automate testing of Business Intelligence applications by making use of the metadata available from the BI tools such as Tableau, OBIEE, and Business Objects.
Products
product_menu_icon01

DataOps Suite

End-to-End Data Testing Automation

product_menu_icon02

ETL Validator

Automate your Data Reconciliation & ETL/ELT testing

product_menu_icon03

BI Validator

Automate functional regression & performance testing of BI reports

product_menu_icon04

DQ Monitor

Monitor quality of data being Ingested or at rest using DQ rules & AI

product_menu_icon05

Test Data Manager

Maintain data privacy by generating realistic synthetic data using AI

About
Free Trial