This article discusses why and how to use both together, and dives into the challenges of Bulk Data Migration to Snowflake.
Why and How?
People are rapidly adopting cloud architectures for Data Warehouses and Machine Learning projects due to the economies of scale in the cloud. One obstacle in achieving rapid success is the data and data types inconsistencies between on-premise structures and the modern data stacks in the cloud.
When you migrate vast amounts of data to the cloud, the opportunity to introduce mistakes is multiplied due to these reasons and others. The earlier you catch the issues, the less costly it is to resolve the discrepancies.
This is where Datagaps come in to play.
This article describes ways in which we can test the data movement between Snowflake and on-premises data and between instances of Snowflake itself. We will also provide some benchmarks for doing comparison testing for large volumes of data moved into Snowflake. Another benefit of the Datagaps approach is, as additional data is moved into the Snowflake structure from other sources, Datagaps provides the ability to monitor the quality of your data structures by continually providing an up to date scoring so that you can determine when data is becoming corrupted.
Benefits of using Datagaps to test data movement into Snowflake
We recently sat down with one of our clients that uses our Dataflow product for testing the data migration from on-Prem SQL Server to Snowflake in the cloud running in AWS.
They started the initial migration by performing bulk loads from 400 SQL Server tables to Snowflake with minimal transformations. This was stage 0, where they could perform source and target data comparison for over 500 million rows of data per table. Making use of the Data Migration wizard, the client was able to generate comparison tests for 400 tables in just a few hours. Even though there were few changes at this stage, they still encountered errors that were surfaced by our
Examples: Issues include numeric precision differences, null value inconsistencies, truncation issues and character interpolations.
Next, they began to perform incremental new data migrations where they continued to find similar issues that had to be corrected. As this continued, they wanted to transition from this incremental new data migration from SQL Server to loading the new data directly into Snowflake to reap the benefits stated earlier. To accomplish this, their initial ETL processes needed to be migrated to an ELT process aimed at Snowflake.
Dataflow was used once again to check the accuracy between the two systems once the new processes were in place. The validations exposed issues in the new ELT process through several iterations until the transformation were in sync. After a short period of testing, they could cut over to the new system and deprecate the SQL server environment. Now Dataflow continued to validate the incremental data as it was moved into Snowflake, finding issues earlier in the cycle where they are less costly to fix in time and lost credibility.
How it makes a difference?
Many of our clients take this one step further by testing their BI tools against the old warehouse and the new Snowflake implementation. They do regression, performance, and stress testing using our BI Validator tool to compare the old with the new. They can find differences in the look and feel in the output generation of the reports and dashboards. Differences are exposed between the report queries when compared to a database query. Often this is the last task necessary to validate the migration process.
Making use of the inbuilt cluster integration with AWS EMR in the DataOps suite, the client was able to automatically scale the EMR cluster on demand to over 30 nodes depending on the data volumes being compared and scale down once the testing has been completed. This capability helped reduce the cost of the testing while still achieving high performance when comparing tables of size 500 Million Records
Reduction in Testing Time by
Improved Testing ROI by
Amount of data tested increased from manual testing of 10,000 sample records to complete testing of 500 M.
In conclusion, the goals of the migration project of agility, cost savings and performance improvements were achieved.
They also realized these benefits months earlier as a result of the improvement in the migration process due to the impact of the dataflow products contribution in an estimated 50% test cycle reduction.
One of our clients reports comparing a file against a Snowflake instance with
Node EMR Cluster
Established in the year 2010 with the mission of building trust in enterprise data & reports. Datagaps provides software for ETL Data Automation, Data Synchronization, Data Quality, Data Transformation, Test Data Generation, & BI Test Automation. An innovative company focused on providing the highest customer satisfaction. We are passionate about data-driven test automation. Our flagship solutions, ETL Validator, Data Flow and BI Validator are designed to help customers automate the testing of ETL, BI, Database, Data Lake, Flat File, & XML Data Sources. Our tools support Snowflake, Tableau, Amazon Redshift, Oracle Analytics, Salesforce, Microsoft Power BI, Azure Synapse, SAP BusinessObjects, IBM Cognos, etc., data warehousing projects, and BI platforms.