Datagaps - Automated Testing Tools for ETL, BI & BigData Testing
Pitfalls of Cloud Data Migration

Pitfalls of Cloud Data Migration

” 36% of data migration projects kept to the forecasted budget, and only 46% were delivered on time “

-Forbes

This statistic shows a grim picture of wasted effort. But to replace CapEx with OpEx cloud data migration and cloud migration, in general, is a popular solution. Enterprises are looking for ways to scale data storage, due to AI and ML and given the volume of data being generated and collected.

Migrating terabytes or even petabytes of data from one location to another is a daunting task. But to understand this further, you need to look beyond the number of bytes. As pointed out, data migration is not without its risk. Being aware of the common hurdles like – data security, privacy, compliance, availability and performance – that could potentially derail your project will increase the likelihood of achieving a successful Cloud Data Migration.

Let’s check what common hurdles that lead to this dismal success rate are.

Common Hurdles of Cloud Data Migration

Dealing With Dirty Data

Accept it! From the onset, almost every organization’s data has a variety of issues. The main culprits are duplicates and inconsistent/incomplete data. Things start to “get dirty” when you have more than 5% duplicates, numerous inconsistencies such as referential integrity, truncated data, and mismatched data formats. When moving data to the cloud data structures, traditional data types often do not match the new destination formats.  To reduce the overall cost of the migration, data needs to be validated before movement. For your records to be accurately mapped, you need to address data inconsistencies and incompletions before migration, as it is far more expensive and challenging to fix in the new data structures.

Failure To Choose The Proper Storage

Selecting the wrong data storage option would hinder the effective operation of applications. The team must vet the cloud storage options suitable for its operations when an enterprise migrates its data and applications to the cloud. When deciding on a provider, enterprises are looking into simplifying data management, supporting new accelerated insights, or lowering costs. Enterprises can select some data class options from AWS, Azure, and Google Cloud, providing Infrastructure-as-a-service (IaaS) cloud storage.

Cloud Applications

 

Mapping Old Data with New Cloud Applications

Modern data structures offer multiple workload support such as Data Warehousing and AI/ML all supported by a single Cloud offering such as DataBricks or Snowflake. This is an entirely new way of thinking about where workloads need to be executed and on which data platform. To take advantage of these modern data platforms, companies need to know how to move to these architectures and what testing is required to ensure the process runs effectively. As an example, new data elements are available in these modern data stacks that take into account things like social media, blobs, video and other data types.

There are various possibilities of change of business workflows like you may have to add two fields two in the new database, and vice versa whereas you may have had one field in your legacy database, or you may change the field names which may create a huge confusion in the whole migration process. So, before it gets more complex, the best way out is to choose a destination for your data in your new database and transfer it from where it currently lives in your legacy system and keep it documented.

Replace completely with the differences between cloud and premise databases

Security and Compliance Adjustments

61% of companies listed security as a primary concern for not moving to the cloud.

Enterprises comply with new standards and acquire licenses while migrating to cloud to avoid any security infringements in the future. To have cloud data, enterprises have to develop a comprehensive security clearance system for various levels of users. There is always high-value intellectual property that may be leaked, lost, or otherwise accessed by unauthorized users. Moving data from one platform to another has high potential for increased risk without the right protocols and plans in place. It could bring significant damage to company reputation or entice potential lawsuits.

Security and Compliance Adjustments

Why is Quality Assurance playing a crucial role in Cloud Data Migration?

How do you verify whether all data is moved/loaded and matches all the rules or data accuracy? This has to be preplanned and implemented before data movement to the cloud. Think about what can go wrong and the cost of recovery when issues arise after moving the data.

Whether you are migrating your data from legacy systems to a new system, cloud, or from one vendor’s software to another’s, it has always been one of the most challenging initiatives for IT managers. Data Accuracy is a key aspect that should be validated through planned testing when loading data from one source to a target system.

Here are a few scary metrics of Data Migration like

  • Migrations have missing or lost data 30% 30%
  • Have some form of data corruption 40% 40%
  • Migration projects have unexpected outage/downtime and no one denies the typical cost of the downtime 64% 64%

 

So, how to mitigate these?

  • Check whether all the required data was transferred according to the requirements.
  • Make sure destination tables are populated with accurate values.
  • Validate that the absence of data loss unless it is based on requirements.
  • Authenticate the performance of custom scripts.

And these are also the objectives of Data Migration Testing. Migrating data is the complex work of a QA team and it requires skill, expertise, tools, and resources. As it is not a simple transfer of information from one storage to another. You need to implement a thorough validation and testing strategy to reduce risk and ensure that the data has been migrated and transformed. The faster a QA team starts analyzing, the faster the issues can be revealed and removed.

Plan A Seamless Cloud Data Migration With Customized Testing Approach

Undoubtedly, with various challenges, both technical, economical, and personnel-related, the process of cloud migration is often fraught. An enterprise must engage with Datagaps to overcome these hurdles. At Datagap, we leverage the experience of having tested large-scale data warehousing and business intelligence applications to help you perform comprehensive testing to check if your data remain functional, stable, scalable, and compatible in the target cloud environment.

Our differentiators are our products – Datagaps ETL Validator and DataOps Dataflow. Datagaps ETL Validator generates hundreds of test cases automatically using Data Migration wizards. On the other side using Apache spark as the engine, Datagaps Dataflow (The best testing tool for Cloud Big Data Testing) can compare billions of records. Datagaps ensures Data Accuracy and Reliability by strengthening your Cloud Data Migration Testing Strategy.

To achieve 100% data validation for data migration projects, talk to our experts

For Quality Data Migration Testing

Datagaps

Established in the year 2010 with the mission of building trust in enterprise data & reports. Datagaps provides software for ETL Data Automation, Data Synchronization, Data Quality, Data Transformation, Test Data Generation, & BI Test Automation. An innovative company focused on providing the highest customer satisfaction. We are passionate about data-driven test automation. Our flagship solutions, ETL Validator, Data Flow and BI Validator are designed to help customers automate the testing of ETL, BI, Database, Data Lake, Flat File, & XML Data Sources. Our tools support Snowflake, Tableau, Amazon Redshift, Oracle Analytics, Salesforce, Microsoft Power BI, Azure Synapse, SAP BusinessObjects, IBM Cognos, etc., data warehousing projects, and BI platforms.

 

Queries: contact@datagaps.com

How Do You Automate Big Data Testing? Everything To Know

How Do You Automate Big Data Testing? Everything To Know

Big Data has passed the tipping point. It’s no longer a fringe interest being practiced by data scientists. The Big Data industry is currently on track to be worth $77 billion by 2023.

There’s a saying in programming and data analysis – “Garbage In, Garbage Out.” That is to say, your data analytics are only as good as the data that fuels them. This is why Big Data testing is so important.

Testing your Big Data by hand rather defeats the purpose of data-driven strategies in the first place. It might not even be possible to assess all of your data depending on how there is. That’s where Big Data testing tools come into play.

Guide To Big Data Testing

Back in 2018, 92% of businesses reported wanting to incorporate Big Data automation testing tools by 2020. Clearly, this is something that’s been on tech-savvy business owners’ minds for some time. Luckily, with today’s Big Data testing tools, this is more feasible than ever for businesses of all sizes.

Data testing is fairly simple and straightforward for routine data applications. Repetitive business practices like forms are highly predictable. A simple program would likely be enough to catch any potential errors in structured data.

Much business data is unstructured or semi-structured. It is estimated that around 80% of data collected by businesses is either unstructured or semi-structured like JSON.

Here are some steps you can take to incorporate an automated Cloud Big Data testing tool in your data pipeline.

Incorporate an ETL Testing Tool

At the beginning of your data pipeline, it’s highly recommended you incorporate an extract, transform, and load (ETL) testing tool. An ETL testing tool can be configured to monitor an incoming data stream for data relevant to your business.

Once this data is gathered, an ETL testing tool will transform the data into a format suitable for your Big Data cloud platform. Once it’s clean, it’s loaded into your data analytics environment.

Implement A Big Data Testing Strategy

You’ll also want to put a solution in place to make sure your Big Data testing tools are functioning properly. This presents certain challenges when dealing with the monumental amount of data that Big Data involves.

A Big Data testing strategy usually involves putting conditions in place to make sure you’re getting the data you need. Some examples of common data constraints could include:

  • Conformity
  • Accuracy
  • Duplication
  • Consistency
  • Completeness

Trying to assess every byte of data could slow your Big Data analytics down to a crawl, however. You’ll also want to decide on the scope for your testing as representative of the entire body. You might test every 10th entry, for instance, and have a subroutine in place of errors rising above a certain rate.

Big Data Testing is Critical

Structure Each Source

To get the most accurate Big Data testing, you should configure each data entry point to make sure the data is configured correctly. Say you wanted to collect data from your blog for analysis. Examples of data you might collect from blog posts might include:

  • Publication data
  • Wordcount
  • Time published
  • SEO metadata
  • Social shares

You should spend some time figuring out where you want to collect data from when you’re compiling your data testing strategy. Once you’ve got a list of where your data is coming from, you should then think about what data you want to harvest from that particular source.

Taking the time to answer these questions will help you set up your ETL tool properly. When all of these steps have been handled correctly, your Big Data pipeline can truly deliver automated insights!

Consolidate Data

Some of your data streams are likely to contain repeat data or monitor the same assets. Leaving all that data unstructured is going to bog down your data analytics platform significantly. You might want to implement an additional abstraction layer for additional processing.

Say you’re analyzing temperature data from a list of cities. This data might be entered as a pair, as is often the case, with the name of the city acting as the key and the temperature as the value.

Depending on where this data is coming from, these values could be returned at a specified rate. Or if it’s coming from a scientific sensor it might be a string of continuous data. You’ll want to determine the scope of the data you want returned to your testing platform, for starters.

Setting up an additional layer for each city makes this problem relatively simple to solve. All of the data for that particular city would be returned to that city’s specific layer. You can put additional constraints in place to make sure the data is in a usable and useful form.

Say you put a filter in place to only return the highest and lowest temperature from a particular city. Now it doesn’t matter if it’s a continuous stream from a sensor or collected periodically. It ensures that all of your data will work nicely together.

It also makes it so that your Big Data testing platform can receive data from however many sources you like. This is essential for making your Big Data testing solution scalable and adaptable to any solution you apply it to!

These are just a few things to keep in mind to illustrate how the right data testing tools and the proper foresight sets you and your data-driven business up for success. It ensures your data is formatted properly no matter how much there is or where it’s coming from.

Are You Looking For Big Data Testing Tools?

Big Data is quickly making science fiction become science fact. Disciplines like machine learning and artificial intelligence were still in the realm of sci-fi even 10 years ago. Now they’re available for anybody to benefit from!

If you’re ready to find out how data-driven tools like Big Data testing can empower you and your business, Sign Up for a Demo today!

Datagaps

Established in the year 2010 with the mission of building trust in enterprise data & reports. Datagaps provides software for ETL Data Automation, Data Synchronization, Data Quality, Data Transformation, Test Data Generation, & BI Test Automation. An innovative company focused on providing the highest customer satisfaction. We are passionate about data-driven test automation. Our flagship solutions, ETL Validator, Data Flow and BI Validator are designed to help customers automate the testing of ETL, BI, Database, Data Lake, Flat File, & XML Data Sources. Our tools support Snowflake, Tableau, Amazon Redshift, Oracle Analytics, Salesforce, Microsoft Power BI, Azure Synapse, SAP BusinessObjects, IBM Cognos, etc., data warehousing projects, and BI platforms.

 

Queries: contact@datagaps.com

Testing Type 2 Slowly Changing Dimensions using ETL Validator

Testing Type 2 Slowly Changing Dimensions using ETL Validator

Type 2 Slowly Changing Dimensions are used in the Data Warehouses for tracking changes to the data by preserving historical values. This is achieved by creating a new record in the dimension whenever a value in the set of key columns is modified and maintaining start and end date for the records. The latest records are either identified by querying for records that are not end dated or by maintaining a flag (eg. current_flg) to easily identify them.

Testing SCD Type 2 Dimensions

Testing SCD Type 2 Dimensions is tricky because it cannot be achieved by a simple comparison of the source and target data.  In this article we will examine different aspects of Type 2 SCD that can be tested using ETL Validator.

For the sake of this article, let’s consider an Employee dimension (EMPLOYEE_D) of SCD Type 2 which is sourced from a table called EMPLOYEE in the source system.

EMPLOYEE Table has the following columns
ROW_ID
EMP_NO
FIRST_NAME
LAST_NAME
SSN
DOB
JOB_TITLE
SALARY

EMPLOYEE_DIM SCD Type 2 Dimension Table has the following columns
ROW_WID
EMP_NO
FIRST_NAME
LAST_NAME
SSN
DOB
JOB_TITLE
SALARY
START_DT
END_DT
CURRENT_FLG

Test 1: Verifying the Current Data

Use a Query Compare test case in ETL Validator to compare the current data records in the SCD Type 2 Employee_Dim with the data in the source Employee table.

Source Query : select ROW_ID, EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY from EMPLOYEE

Target Query : select ROW_ID, EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY from EMPLOYEE_DIM where CURRENT_FLG = ‘Y’

Test 2: Verifying the uniqueness of the key columns in the SCD

The combination of the key columns in the SCD should be Unique. For the above example, the columns EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY comprise of an unique key in the EMPLOYEE_DIM dimension. This can be easily verified using the Duplicate Check Rule in the Data Rules test plan of ETL Validator. The query generated by ETL Validator using the Duplicate Check Rule should be something like below:

Select EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY, COUNT(*) CNT from EMPLOYEE_DIM group by EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY having COUNT(*)>1

This query should not return any rows.

Test 3: Verifying that historical data is preserved and new records are getting created

Whenever there is a change to the values in the key columns a new record should be inserted in the EMPLOYEE_DIM and the old record should be end dated. ETL Validator’s Component Test Case can be used to verify this functionality. The Component test case has a feature of Baseline and Compare which can be used to identify the changes in the EMPLOYEE_DIM.
Below are the steps :

  1. Create a Component test case and take a snapshot of the current values in the EMPLOYEE_DIM (called Baseline).
  2. Modify a few records in the source EMPLOYEE table by updating the values in the key columns such as SALARY, LAST_NAME.
  3. Execute the ETL process so the the EMPLOYEE_DIM has the latest data.
  4. Run the Component test case to compare the Baseline data with the Result table and identify the differences. Verify that the differences are as expected.

ETL Validator thus provides a complete framework for automating the testing of SCD Type 2 dimensions.

Connecting to Mongo DB using Apache Drill in ETL Validator

Connecting to Mongo DB using Apache Drill in ETL Validator

Over the last few years, we have been working with a number of customers across various industries such as Life Sciences, Financial Services, Higher Education etc.  While the problems related to data testing are similar across industries and JSON and flat files are very common, there are many differences in the file formats.  In few industries, Avro is popular while in others, based on use case, Parquet is more commonly used.

At Datagaps, one of the key challenges is our ability to empower customers to test any of the formats with ease. Of course, we have an option to build native connections for each format but that may not be the best option sometimes.  At Datagaps, we try to leverage open standards, open source frameworks to support our customers. This is where Apache Drill (https://drill.apache.org/) comes into play.

Over the last 6 months, we have been drilling around a bit and absolutely love the speed and flexibility that Apache Drill provides. As of version 3.4.5, we use Drill as the interface between ETL Validator and any of the file formats mentioned above (except flat files since flat files are much more common and deserve native connectors). In this blog, I wanted to take few minutes and explain how easy it is for you to get started with Drill and integrate with ETL Validator.

Assumptions

1. You are running this proof of concept on Windows.

2. You have ETL Validator Complete 3.4.5 or higher installed on the machine.

3. You have Mongo running on the same machine.

Apache Drill Installation

Step 1:  Go to https://drill.apache.org/ and the get the latest version of  Drill. In this blog, we assume version 1.9 of Drill.

Step 2: Using 7z or a similar tool, un-archive the file.

Step 3: Navigate to the bin folder of Apache Drill and then execute the following command “sqlline -u jdbc:drill:zk=local”. This will start the Apache Drill server and now you are ready to define your Mongo database.  Once Drill is started successfully, you should see something like this:

Step 4: Launch the web interface to Drill using https://localhost:8047. This interface should be available once the command in step # 3 is run successfully and should look like this:

Step 5: Navigate to  “Storage” tab and enable the Mongo storage in Apache Drill. Since Mongo is running on the same machine, you do not have to make any changes. Your Mongo setting should look like this:

Setting up Mongo connection in ETL Validator using Drill

Step 1: Launch ETL Validator

Step 2: Create a new “Mongo through Drill” Connection.

  • Name: Can be anything you want.
  • User Name & Password: You can put null/null if there is no username/password. This is a know issue and will be fixed in upcoming versions.
  • Connection Type: Leave it as Auto
  • Host Name: You can provide localhost.
  • Port: Leave the default port on which Drill must be running.
  • Retrieve and select a schema from Mongo.

Step 3: Test the Connection.

Now, you are ready to execute Query Comparison or other test plans in ETL Validator. Few sample queries are below. Depending on the complexity of the JSON structure, you may need more time to understand how to get the right query.

  • SELECT * from mongo.test.`addresses`;
  • SELECT  ad.address.building FROM mongo.test.addresses as ad
  • SELECT  ad.address.coord FROM mongo.test.addresses as ad

The above queries assume that the JSON is as follows and these documents are in the mongo.test schema.

{“address”: {“building”: “1007”, “coord”: [-73.856077, 40.848447], “street”: “Morris Park Ave”, “zipcode”: “10462”}, “borough”: “Bronx”, “cuisine”: “Bakery”, “grades”: [{“date”: {“$date”: 1393804800000}, “grade”: “A”, “score”: 2}, {“date”: {“$date”: 1378857600000}, “grade”: “A”, “score”: 6}, {“date”: {“$date”: 1358985600000}, “grade”: “A”, “score”: 10}, {“date”: {“$date”: 1322006400000}, “grade”: “A”, “score”: 9}, {“date”: {“$date”: 1299715200000}, “grade”: “B”, “score”: 14}], “name”: “Morris Park Bake Shop”, “restaurant_id”: “30075445”}

{“address”: {“building”: “469”, “coord”: [-73.961704, 40.662942], “street”: “Flatbush Avenue”, “zipcode”: “11225”}, “borough”: “Brooklyn”, “cuisine”: “Hamburgers”, “grades”: [{“date”: {“$date”: 1419897600000}, “grade”: “A”, “score”: 8}, {“date”: {“$date”: 1404172800000}, “grade”: “B”, “score”: 23}, {“date”: {“$date”: 1367280000000}, “grade”: “A”, “score”: 12}, {“date”: {“$date”: 1336435200000}, “grade”: “A”, “score”: 12}], “name”: “Wendy’S”, “restaurant_id”: “30112340”}

In the next blog, we will show you an example of comparing JSON data with table data using the Query Compare Test Plan. Stay tuned.

Data Testing Automation for Salesforce using ETL Validator

Data Testing Automation for Salesforce using ETL Validator

Over the last few years, the Salesforce platform has become an incredible force in the market for various reasons. Of course, the most obvious use case is for the CRM capabilities. In addition, many organizations have started using the power of the force platform to build and deploy custom applications in the cloud at an incredibly fast pace.

While this journey is truly exciting, there will always be a burning underlying need to be able to test the data and ensure that it is always as expected. In this blog, I just thought of highlighting a few use cases and how ETL Validator can help you in addressing those.

Use Case 1: Comparing the data between a Salesforce Org and an OnPremise database.

Consider a simple order capture application that has been moved from an OnPremise to Salesforce. Also assume that the application has few basic objects; Account, Products, Order and Order Line Item tables. Now, after you move the data, few questions emerge:

  1. Did I get ALL the accounts from On Premise application to the Salesforce app?
  2. What is the best way for me to compare the counts?
  3. Are all the records between Salesforce and On Premise system matching from a data integration perspective?
  4. Is referential integrity between accounts, orders and order line items properly maintained in Salesforce?

These are simple questions but are extremely important to have the confidence on the data migration process to the Salesforce platform. Using ETL Validator, you can easily create connections between Salesforce and your On Premise Database and execute the above tests in no time.

Use Case 2: Baselining the Product Catalog

Lets’s say there are 1000 products in the catalog and it is important  to ensure that that the metadata of this product catalog does not get accidentally modified. How would you do that?

In ETL Validator, you can baseline a table and then run tests on an ongoing basis to ensure that the data accidentally does not get modified. If it does, then the platform can send out notifications with the records that do not have data as expected.

Use Case 2: Baselining the Product Catalog

Over a period of time, it is important to understand the changes to profiles, roles, access privileges etc in the Salesforce platform so that only the expected changes are going in with each internal release and nothing else is slipping through the cracks.

Using ETL Validator, you can take a baseline of the object metadata and then compare the same over time. If there are any differences, similar to the above use case, ETL Validator can send out notifications and alert administrators of unexpected changes.

To learn how to get started with Salesforce,  you can check this out the link below:

http://datagaps.freshdesk.com/support/solutions/articles/1000241490-connecting-to-salesforce

Database (Backend) Testing for Oracle R12 Upgrade

Database (Backend) Testing for Oracle R12 Upgrade

Oracle E-Business Suite (EBS) R12 is a significant new version with valuable new features and capabilities. Although there is an upgrade path from EBS 11i to R12, most companies reimplement R12 and migrate the data from their 11i instance. Reimplementation can be a complex project but it also gives them the option to improve their implementation.

When transitioning from EBS R12.1 to R12.2 companies generally perform an inplace upgrade. One of our customer was upgrading from EBS R12.1 to R12.2 and wanted to verify that the upgrade did not cause any issues to the data in their data warehouse. While testing the data warehouse and the dashboards can help identify data issues during the upgrade, it is important to test the data in the EBS R12 instance from the backend. This type of testing is called database testing.

Database Testing for Reimplementation (eg. 11i to R12 transition)

Database (or Backend) testing from the reimplementation project is similar to the testing of Data Migration projects where data gets migrated from a legacy application to a new application.

The main goals of the Reimplementation testing are :

  • Verify that the data has been fully migrated from 11i to the R12 instance. Most customers want to perform 100% data validation which may be required in regulated industries such as Pharma and Financial.
  • Verify the referential data integrity after the migration from 11i to R12 instance. For example, some of the records in the child table may get mapped to a different parent record or become orphan records during the migration.
  • In case of any data cleanup during the migration verify that the R12 data is following the data accuracy and consistency rules laid out for the cleanup effort.

Our ETL and Data testing tool, ETL Validator comes with several different types of test cases and test plans for simplifying and automating the Database testing for reimplementation project (or data migration testing)

  • Query Compare Test Case : Makes it easy to compare large volumes of data between 11i and R12 databases (backend).
  • Data Profile Test Case : Compares the aggregates (or checksum) for the data between 11i and R12 databases.
  • Foreign Key Test Plan : Validates the referential integrity of the migrated data in the R12 database.
  • Data Rules Test Plan : Validates the migrated data for data accuracy and consistency rules in the R12 database.

Database Testing for Inplace Upgrades (eg. R12.1 to R12.2 upgrade)

When performing an inplace R12 upgrade, it is important to understand the impact on the data because of the upgrade and perform regression tests on it. Another aspect of the testing is to verify the impact of the R12 upgrade on the ETL and the Data Warehouse. One way to test data for inplace upgrades is by taking snapshot of the data prior to the upgrade and comparing the snapshot with the data post upgrade. The snapshot can be of entire data in the tables or results of the SQL queries on the R12 database depending on the volume of the data in the tables. Any differences found between the snapshot data and the post upgrade data need to be analyzed and validated.

ETL Validator can be used for performing the database (or backend) testing of r12 inplace upgrades using its Baseline & Compare capabilities. Baseline and Compare is an unique feature of ETL Validator which allows the user to take a snapshot of the database query results and compare the latest result with the snapshot to identify the differences. ETL Validator also provides a Baseline and Compare wizard to automatically generate test cases for multiple tables using one easy to use wizard.