Datagaps is the only company to be listed in Gartner® DataOps Tools & Data Observability market guides

DataOps Suite Sandbox

Getting Started – Sandbox Documentation

Introduction to DataOps Suite Sandbox

Welcome to the DataOps Suite Sandbox. Before starting with the specific use cases you want, we highly recommend watching this below video and documentation. This introduction will start with an overview of how test cases are created from scratch, followed by a few exercises. The solutions of the exercises are already present in the Sandbox for reference.

ETL Testcase Creation Flow

DataOps Suite has a full-fledged ETL tool as its backbone to support its ETL Validation capabilities. Thus, it helps to look at the product as kin to typical ETL tools.

Firstly, a source and target connection need to be defined to pull and push datasets with the Suite.

These connections are defined as “Sources” in the Sources Tab and can range from JDBC Connections to Flat Files to APIs and much more.

Once the connections are defined, data can be pulled into Suite and then transformed, tested, compared, profiled, analyzed, and/or pushed back into a different target. This is achieved in the Dataflows Tab. After test cases (dataflows) are created, they can be bundled together, sequenced, and ran as Pipelines in Pipeline Tab. The Scheduler helps you schedule the pipeline run. All dataflow and pipeline runs have a History section to look back at previous runs.

A vast majority of application uses are to follow here, such as:

REST API Connections to DataOps Suite for automation and integration
Parameterization
Plugins

Password Vault
CLI Commands
In-Suite Wizards for Automation of Test Case Creation

For these, we highly recommend tinkering with the application, going through the documentation, and connecting with our support desk (accessible via the application). We also provide a full fledge training certification for the Suite.

Read: 5 Must haves in your Data Testing Platform

Sandbox Exercises

Before testing specific use cases, kindly attempt the following exercises in the order provided to understand the Suite better. We will be working with the following business use cases as the basis of our exercises.

Our architecture involves a SQL Server, an Oracle DB, and an AWS s3 as our main 3 data sources. We want to perform the following validation –

1. The Dataset “DimCustomer” is being Migrated from Oracle to SQL Server.
  1. Validate the Metadata of the dataset between Oracle and SQL Server.
  2. Validate all records between the two datasets to ensure data has been migrated correctly.
2. The records in the file “Customers.csv” in the s3 location have to be validated for the following –
  1. Null checks on Primary key
  2. Check for duplicates in the Primary key
  3. Validate if the Year of Birth is between 1930 and 1990
  4. Validate the Email Addresses to contain “@company.com”
  5. Validate that the average credit limit is below 6500
3. ETL validation (White Box Testing) – A Developer has updated the DimEmployee Table with the logic as directed by the business team. You need to validate the transformation using White Box Testing Technique.
  1. You are provided with the data dictionary of a table
  2. You are provided with the transformation logic
  3. You are provided with the updated file
  4. Validate the transformation by re-creating the transformation

Source Validation

Note: The first step of the process, the creation of Sources, is already done in the Sandbox.

Before proceeding, kindly validate your Data Sources. To do so –

Go to the “Sources” Tab
Click on “Data Source(s)”
Click on each source and click on “Test” (top-right)
This will validate the connection

Post this; the exercise is focused on the “Dataflows” tab. Here you should see three flows already built as solutions for the exercises for reference.

Exercise 1 – Migration Validation

Context – Validation of migrated assets is a primary use case in many migrations and DQ processes.

The Dimension Table “DimProduct” was transferred from an Oracle DB to a SQL Server. We need first to validate the metadata of both tables; then, we need to validate all the records of both tables against each other.

Create a new dataflow within the Dataflow Tab using the “Create Dataflow” button (top-right).

Nodes used – JDBC Connector, Metadata Compare, Data Compare

Steps –

1. First, let’s compare the metadata of the table in Oracle and SQL Server.
  1. Add the Metadata Compare Node to the tray
    1. Select the Sources as intended
    2. In the “Objects” tray, drag and drop only “DimProduct”
      1. Its counterpart should automatically be found and added
    3. Click “Run” to see the results. This wraps up the Metadata Comparision part.
  2. Next, let’s pull the dataset DimProduct from the Oracle DataSource.
    1. Add the JDBC node to the tray
    2. Select Oracle as the Data Source
    3. In the Query Window, either use the query builder, drag the table from the metadata window post-search or type out the required query
    4. You can preview and row-count your query before clicking run
    5. This dataset is now loaded
  3. Repeat the process for the “DimProduct” from SQL Server
  4. Now, we do the data comparison test –
    1. Add the “Data Compare” Node. Make sure this node is dependent on both the JDBC connectors.
    2. Select the two datasets from the two distinct data sources
    3. Update the options as per the metadata results. Note – Since there are inherent differences, if you want to validate the data while converting the data types to the correct middle ground, define those and other options here and in the Mapping Tab
    4. In the Mapping Tab, make the necessary datatype conversions (in most cases, simple conversion to integer or double would suffice). Remove the columns which you don’t want to compare.
    5. Run the Node to see the results.

Results –

The metadata is very different and filled with mismatches. This is expected when working with different databases, as auto-importing and settings change the data.
The datatype mismatches and lack of columns between the two sets cause a lot of mismatches to be found.
Post, using the advanced functions, mapping a few columns to integers and doubles, along with removing the “LargePhoto” and “FinishedGoodsFlag”, we get a full match.
1. The reason we removed the flag has to do with the conversion of “true/false” to “1/0” which would require a conversion. This is possible using the custom expression in the mapping. For the sake of simplicity, we remove this column for now.
2. The reason we removed LargePhoto has to do with the lack of the column in the SQL Server
We see that the “Weight” column has 14 mismatched records. Note that these mismatched records can be pulled as a dataset in subsequent nodes for data reconciliation, processing, or sinking.

Exercise 2 – Data Quality

Context – Validation of the data based on integrity, completeness, business requirements, and logical checks is part of any type of data validation test case. One might want to validate a view, a function, the response from APIs or any type of data. In this example, we will validate the data present in the file “Customers.csv”. We want to validate the following –

Null checks on Primary key
Check for duplicates in the Primary key
Validate if the Year of Birth is between 1930 and 1990
Validate the Email Addresses to contain “@company.com”
Validate that the average credit limit is below 6500

Create a new dataflow within the Dataflow Tab using the “Create Dataflow” button (top-right).

Nodes used – File Connector, Data Rule

Steps –

First, pull the file using the File Connector in Sources.
1. Select AWS S3 as the source
2. In the File Tab, browse and find the file “Customers.csv”
3. Update the read parameters if needed
4. Run the Node to load the dataset
Now, we add the validation rules.
1. Add the “Data Rule” Node. Make sure this node is dependent on the File connector.
2. In the Rules Tab, add the rules as per the requirement. As a hint, here are the type of rules you will need for the business use cases are
  1. Duplicate Check
  2. Attribute Check
  3. Metric Compare
3. Run the Node to view the results and the bad records

Results –

Most of the records pass the test with a score of 99.1
Year of Birth test fails as 1920s records exist in the dataset
Email Address test fails as “company2.com” exists in the dataset

Exercise 3 – Transformation Validation

Context – Based on business requirements, the DimEmployee table has to be updated. The mappings are provided on the right. A developer already implements the changes, and the updated dataset is saved as a file named “Dim_Employee_Updated” in AWS S3. You are a QA tester and will perform white box testing to validate this transformation. Using the mapping spreadsheet, create an updated version based on your implementation of the raw data from the SQL Server and compare it to the updated CSV file to validate the transformation.

Create a new dataflow within the Dataflow Tab using the “Create Dataflow” button (top-right).

Nodes used – File Connector, JDBC Connector, SQL

Steps –

First, using the JDBC Connection, create a node and connect to SQL Server
1. Pull the DimEmployee Table
Using the SQL or Code node, apply the transformation logic
Add a File Connector and pull the file from the folder – “Dim_Employee_Updated”
Add a Data Compare node that compares the Transformed Data and the Updated CSV File with the correct options.

Results –

We see a ~19% mismatch only on the BaseRate column.
Further inspection reveals that the Developer likely used a “Floor” function instead of a “Round” function.

Data Quality Monitor

The sandbox also comes with a Data Model in the Data Quality Tab with rules already defined test cases to help new users see how business rules and data quality test cases can be defined. New users are encouraged to explore DQM by –

1. Creating a New Data Model
2. Importing Tables for the Model
3. Adding Rules to tables
4. Importing and Exporting Rules
5. Using the Wizards for automation of Rules Creation

These activities are left for the user as an exercise. The video can be used as a reference.

Conclusion

Thank you for completing the exercises. With the basics covered, you and your team should understand the plethora of nodes and toolsets the application provides better. Feel free to tinker with the environment to fit your specific use cases and explore all the nodes and applications. The application can also be connected via CLI Commands, REST APIs, and integrated plugins. Our documentation, which can be easily accessed in the bottom left corner of the Suite, will provide your team with ample starting resources. The application comes with wizards, pipelines, and monitoring capabilities that new users can easily configure and deploy. In case of any specific examples, feel free to contact us directly.

In case of any questions or errors that your team cannot resolve, feel free to contact our support team either by contacting us on our email or using the support portal present in the bottom left corner of the Suite.

Subscribe us to get updates about our Product Enhancements, Newsletters, Webinars and more

By subscribing you’re allowing data graphs and/or its associates to reach you with periodic informative updates.