DataOps Suite: Validate Rest API Response

DataOps Suite: Validate Rest API Response

DataOps Suite supports data validation for different kinds of Data sources. Along with other Data sources also it has support for Rest API.

DataFlow is a powerful application using which you can easily perform end-to-end automation of a data migration process. In DataFlow, there are different kinds of components to serve different purposes. One of them is the Code Component. It supports three kinds of languages. Spark SQL, Scala, Python. Using the Code Component, you can write queries or code on top of different datasets created in the current DataFlow. This gives you flexibility to do different DML operations on top of the existing datasets. This document covers reading the data from a Rest API, converting into a dataset and comparing it with another dataset from file.

Please go through the following steps:

    1. On the left menu, select ‘DataFlows’.
    2. Click on ‘New dataflow’ button on the top right corner.

DataOps-Suite Validation of Rest API data-1.jpg

    1. In the ‘New Dataflow’ dialog, fill the details and Save.
      1. Name: Any random name to identify dataflow.
      2. Livy server: By default, the application comes with a default livy server. Select any configured livy server.

DataOps-Suite Validation of Rest API data-2

    1. During the first run, all the list of components are displayed (as shown in the below image). Select Code component in the processor bucket.

DataOps-Suite Validation of Rest API data-3

    1. A new Code component will be opened with the ‘Properties’ tab selected. Fill the details and then go to the next step –
      1. Name: Any random name.
      2. Dependency: Not required as it is the first component.
      3. Description: Optional. You can give any useful info about the code component.
      4. Dataset name: Name of the dataset which you want to create using the code. You can give multiple names, separated by commas.

    1. Select the kind. Scala is selected by default. The Code component supports Scala, Python and SparkR languages. By clicking on the ‘Sample API code’ button on the top right, a sample code will be populated to read data from Rest API. Here a sample code is provided.

DataOps-Suite Validation of Rest API data-5
Code:

import spark.implicits._;

var jsonStr =scala.io.Source.fromURL("http://192.168.6.42:9080/DataPrepRest/api/v1.0/templates/table?containerId=81&userName=sh&password=******&url=jdbc:oracle:thin:@192.168.6.76:1521:orcl&schema=sh&table=customers").mkString;

var df = spark.read.json(Seq(jsonStr).toDS());

df.createOrReplaceTempView("code_ds");


df.cache();
After execution of the above code a dataset with the name code_ds will be created. You can write multiple sets of such codes and can create multiple datasets. These datasets should be listed in the Properties tab as mentioned in the 5th step.

    1. Create a new File component. Fill the details and move to the next step –

      1. Name: Any random name.
      2. Data Source: File data source list can be seen here. You need to select a data source.
      3. Dependency: In the present case, there’s no need to give any dependency.
      4. Description: Optional. Write some basic info about the component.
      5. Dataset name: For File components, only one dataset will be created. Default name will be populated based on the component name. You can enter your desired Dataset name.

DataOps-Suite Validation of Rest API data-8

    1. In the File step, fill in the details as shown below –
      1. File Name: Give the filename you want to read. Enter the filename manually or select a filename in the Files panel on the right.
      2. Encode: Optional (File encoding type).
      3. Options: Spark file read options. Some important options will be popped up with the default values. Please go through the following link for further info –
        https://docs.databricks.com/data/data-sources/read-csv.html.

DataOps-Suite Validation of Rest API data-9

    1. code_ds is the dataset created by reading Rest API data in the Code component. By default, all the data types will be considered as string. If you want to change these data types or column names, you can use the Attribute component. For any dataset, you can change the data types by using the Attribute component. Create a new Attribute component by using Add component.


Fill the details –

      1. Name: Any random name
      2. Source Dataset: For which dataset user wants to change data types and column names. In the current example, we are selecting code_ds which is the output of the Code component.
      3. Dependency: As this component can be run only after creation of code_ds dataset from the Code component, you must add the code component in the dependency list.
      4. Description: Optional (description of the component).
      5. Dataset Name: Output dataset name. After converting data types and column names, a new dataset will be created as the output. Default is the component name.

DataOps-Suite Validation of Rest API data-11

    1. In the rename step, enter the desired column names and data types.

DataOps-Suite Validation of Rest API data-12

    1. Save and Run the component.
    2. Now click on Add component and select ‘Data Compare’ from the ‘Data Quality bucket’ as shown below –

DataOps-Suite Validation of Rest API data-13

    1. In the ‘Data Compare’ component, you need to give two datasets as input. The comparison would be between these two datasets. Fill the details –
      1. Name: Any Random name.
      2. Dataset A: In this example we are selecting the output of Attribute component.
      3. Dataset B: In this example we are selecting output of File component.
      4. Dependency: In this example we are consuming datasets from file component and attribute component. So give both of them as dependencies.
      5. Compare type: Different comparison types you want to run.
      6. Description: Description of the component.
      7. Dataset Name: Default name will be populated.

DataOps-Suite Validation of Rest API data-14

    1. In the mapping step both dataset A and dataset B will be mapped by order of columns by default. If required, you can click on the “Remap by Name” button to reorder the mapping. You can select unique keys, then comparison would be based on keys. It allows multiple key columns. Move to next step after changes done.

DataOps-Suite Validation of Rest API data-15

    1. Run the component. Each component executes in a set of statements. Data Comparison component contains more number of statements and the progress of execution can be seen at the bottom in the ‘Run’ tab. After the Run is completed, you can see the component results as shown in the following images. At the bottom, you can see failed and passed statements. These statements contain Duplicate calculation, Only in Dataset A, Only in Dataset B, Differences etc. For each calculation a statement will be there. By clicking on the link, you can see the details of each statement.

DataOps-Suite Validation of Rest API data-16

    1. By clicking on the difference count statement, a window pops up. Please check the following image.

DataOps-Suite Validation of Rest API data-17

    1. Now the design of dataflow has been completed. It’s a one time step. You can run the dataflow whenever you want just by clicking the “Run dataflow” button at the top of the dataflow window. Then the following window opens –

DataOps-Suite Validation of Rest API data-18
This image is created based on the dependencies given. And it is the execution order of components. Here each color indicates the progress of different components –

    • Green: Successfully completed and the status is passed.
    • Blue: In Queue.
    • Yellow: Running
    • Red: Completed. But the status is failure.
Data Validation for Snowflake

Data Validation for Snowflake

Snowflake Data Migration

Migrating to Snowflake Cloud Data Warehouse

Migration from a legacy data warehouse such as Netezza to a cloud based Snowflake data warehouse requires multiple steps. Data Validation is the key to success of data migration projects. Datagaps Data Flow can be used to validate data in each step of the data migration as well as the end-to-end data validation scenarios.

Step 1: Extract data from Legacy data warehouse

Data is typically extracted into CSV or Parquet format and moved to a landing zone in AWS S3. Depending on the data volumes, AWS offers multiple options to moving the files to S3. Once the data has been moved to AWS S3, data validations need to be performed to ensure that all the data was properly extracted and migrated to AWS S3. Since there is not much transformation in this step, these tests are typically one-to-one comparison of the data in the tables in the legacy data warehouse and the files in AWS S3 landing zone.

– Compare table to file row counts
– Compare data encoding
– Compare data completeness
– Compare data values

A sample test case diagram is shown to the right. JDBC Component can be used to read data from the legacy data warehouse. File Component can be used to read data from AWS S3. Finally, Data Compare component can be used to compare the two datasets. Sample output for a data compare component is shown below.

Data comparison test case
Output of data comparison

Step 2: Transform data

Transformations such as data type conversions can be performed in this step. Data curation can be also done to improve the data quality before the data is loaded into Snowflake. Before curating the data, it is important to profile the data and run data quality tests to identify data quality issues with the data. Data Flow can be used to perform these tasks.

– Compare data between landing zone and staging (curated) zone in S3
– Use Data Profile and Data Rules components to identify data quality issues
– Curate data and sync to the staging zone

Data Profile component
Data Rules component

Step 3: Copy data to Snowflake

Assuming that the Snowflake tables have been created, the last step is to copy the data to snowflake. Use the VALIDATE function to validate the data files and identify any errors. Data Flow can be used to compare the data between the Staging Zone (S3) files and Snowflake after the load.

– Compare table to file row counts
– Compare data encoding
– Compare data completeness
– Compare data values
– End-to-end data validation (Legacy data warehouse to Snowflake)

Data Flow can be used to perform end-to-end data validation in a single test as shown to the right. A single data flow can be used to compare data between legacy data warehouse and S3 as well as legacy data warehouse and Snowflake.

End-to-end test case

Step 4: Modify reports to use Snowflake

While snowflake provides JDBC/ODBC drivers and supports most of the commonly used SQL functions, there are going to some differences between the way reports are developed and executed in the legacy data warehouse and Snowflake. Once these changes are made, thorough testing needs to be performed between the reports using the legacy data warehouse and the equivalent reports using Snowflake.

– Compare report data
– Compare report layout
– Compare report performance
– Stress test reports in the new environments by simulating concurrent user loads
– Compare security

Datagaps BI Validator is a nocode BI testing tool that can help automate all these tests for the supported BI tools.

Testing Type 2 Slowly Changing Dimensions using ETL Validator

Testing Type 2 Slowly Changing Dimensions using ETL Validator

Type 2 Slowly Changing Dimensions are used in the Data Warehouses for tracking changes to the data by preserving historical values. This is achieved by creating a new record in the dimension whenever a value in the set of key columns is modified and maintaining start and end date for the records. The latest records are either identified by querying for records that are not end dated or by maintaining a flag (eg. current_flg) to easily identify them.

Testing SCD Type 2 Dimensions

Testing SCD Type 2 Dimensions is tricky because it cannot be achieved by a simple comparison of the source and target data.  In this article we will examine different aspects of Type 2 SCD that can be tested using ETL Validator.

For the sake of this article, let’s consider an Employee dimension (EMPLOYEE_D) of SCD Type 2 which is sourced from a table called EMPLOYEE in the source system.

EMPLOYEE Table has the following columns
ROW_ID
EMP_NO
FIRST_NAME
LAST_NAME
SSN
DOB
JOB_TITLE
SALARY

EMPLOYEE_DIM SCD Type 2 Dimension Table has the following columns
ROW_WID
EMP_NO
FIRST_NAME
LAST_NAME
SSN
DOB
JOB_TITLE
SALARY
START_DT
END_DT
CURRENT_FLG

Test 1: Verifying the Current Data

Use a Query Compare test case in ETL Validator to compare the current data records in the SCD Type 2 Employee_Dim with the data in the source Employee table.

Source Query : select ROW_ID, EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY from EMPLOYEE

Target Query : select ROW_ID, EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY from EMPLOYEE_DIM where CURRENT_FLG = ‘Y’

Test 2: Verifying the uniqueness of the key columns in the SCD

The combination of the key columns in the SCD should be Unique. For the above example, the columns EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY comprise of an unique key in the EMPLOYEE_DIM dimension. This can be easily verified using the Duplicate Check Rule in the Data Rules test plan of ETL Validator. The query generated by ETL Validator using the Duplicate Check Rule should be something like below:

Select EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY, COUNT(*) CNT from EMPLOYEE_DIM group by EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY having COUNT(*)>1

This query should not return any rows.

Test 3: Verifying that historical data is preserved and new records are getting created

Whenever there is a change to the values in the key columns a new record should be inserted in the EMPLOYEE_DIM and the old record should be end dated. ETL Validator’s Component Test Case can be used to verify this functionality. The Component test case has a feature of Baseline and Compare which can be used to identify the changes in the EMPLOYEE_DIM.
Below are the steps :

  1. Create a Component test case and take a snapshot of the current values in the EMPLOYEE_DIM (called Baseline).
  2. Modify a few records in the source EMPLOYEE table by updating the values in the key columns such as SALARY, LAST_NAME.
  3. Execute the ETL process so the the EMPLOYEE_DIM has the latest data.
  4. Run the Component test case to compare the Baseline data with the Result table and identify the differences. Verify that the differences are as expected.

ETL Validator thus provides a complete framework for automating the testing of SCD Type 2 dimensions.

Connecting to Mongo DB using Apache Drill in ETL Validator

Connecting to Mongo DB using Apache Drill in ETL Validator

Over the last few years, we have been working with a number of customers across various industries such as Life Sciences, Financial Services, Higher Education etc.  While the problems related to data testing are similar across industries and JSON and flat files are very common, there are many differences in the file formats.  In few industries, Avro is popular while in others, based on use case, Parquet is more commonly used.

At Datagaps, one of the key challenges is our ability to empower customers to test any of the formats with ease. Of course, we have an option to build native connections for each format but that may not be the best option sometimes.  At Datagaps, we try to leverage open standards, open source frameworks to support our customers. This is where Apache Drill (https://drill.apache.org/) comes into play.

Over the last 6 months, we have been drilling around a bit and absolutely love the speed and flexibility that Apache Drill provides. As of version 3.4.5, we use Drill as the interface between ETL Validator and any of the file formats mentioned above (except flat files since flat files are much more common and deserve native connectors). In this blog, I wanted to take few minutes and explain how easy it is for you to get started with Drill and integrate with ETL Validator.

Assumptions

1. You are running this proof of concept on Windows.

2. You have ETL Validator Complete 3.4.5 or higher installed on the machine.

3. You have Mongo running on the same machine.

Apache Drill Installation

Step 1:  Go to https://drill.apache.org/ and the get the latest version of  Drill. In this blog, we assume version 1.9 of Drill.

Step 2: Using 7z or a similar tool, un-archive the file.

Step 3: Navigate to the bin folder of Apache Drill and then execute the following command “sqlline -u jdbc:drill:zk=local”. This will start the Apache Drill server and now you are ready to define your Mongo database.  Once Drill is started successfully, you should see something like this:

Step 4: Launch the web interface to Drill using https://localhost:8047. This interface should be available once the command in step # 3 is run successfully and should look like this:

Step 5: Navigate to  “Storage” tab and enable the Mongo storage in Apache Drill. Since Mongo is running on the same machine, you do not have to make any changes. Your Mongo setting should look like this:

Setting up Mongo connection in ETL Validator using Drill

Step 1: Launch ETL Validator

Step 2: Create a new “Mongo through Drill” Connection.

  • Name: Can be anything you want.
  • User Name & Password: You can put null/null if there is no username/password. This is a know issue and will be fixed in upcoming versions.
  • Connection Type: Leave it as Auto
  • Host Name: You can provide localhost.
  • Port: Leave the default port on which Drill must be running.
  • Retrieve and select a schema from Mongo.

Step 3: Test the Connection.

Now, you are ready to execute Query Comparison or other test plans in ETL Validator. Few sample queries are below. Depending on the complexity of the JSON structure, you may need more time to understand how to get the right query.

  • SELECT * from mongo.test.`addresses`;
  • SELECT  ad.address.building FROM mongo.test.addresses as ad
  • SELECT  ad.address.coord FROM mongo.test.addresses as ad

The above queries assume that the JSON is as follows and these documents are in the mongo.test schema.

{“address”: {“building”: “1007”, “coord”: [-73.856077, 40.848447], “street”: “Morris Park Ave”, “zipcode”: “10462”}, “borough”: “Bronx”, “cuisine”: “Bakery”, “grades”: [{“date”: {“$date”: 1393804800000}, “grade”: “A”, “score”: 2}, {“date”: {“$date”: 1378857600000}, “grade”: “A”, “score”: 6}, {“date”: {“$date”: 1358985600000}, “grade”: “A”, “score”: 10}, {“date”: {“$date”: 1322006400000}, “grade”: “A”, “score”: 9}, {“date”: {“$date”: 1299715200000}, “grade”: “B”, “score”: 14}], “name”: “Morris Park Bake Shop”, “restaurant_id”: “30075445”}

{“address”: {“building”: “469”, “coord”: [-73.961704, 40.662942], “street”: “Flatbush Avenue”, “zipcode”: “11225”}, “borough”: “Brooklyn”, “cuisine”: “Hamburgers”, “grades”: [{“date”: {“$date”: 1419897600000}, “grade”: “A”, “score”: 8}, {“date”: {“$date”: 1404172800000}, “grade”: “B”, “score”: 23}, {“date”: {“$date”: 1367280000000}, “grade”: “A”, “score”: 12}, {“date”: {“$date”: 1336435200000}, “grade”: “A”, “score”: 12}], “name”: “Wendy’S”, “restaurant_id”: “30112340”}

In the next blog, we will show you an example of comparing JSON data with table data using the Query Compare Test Plan. Stay tuned.

7 Incredible Test Automation Plans for Tableau using BI Validator

7 Incredible Test Automation Plans for Tableau using BI Validator

It is nice to see Tableau maintain its leadership position in Gartner’s magic quadrant for the 3rd consecutive year.

At Datagaps, this is pretty important for us as the testing automation needs for various BI platforms follow a similar trend. It is interesting that we do not see the traditional BI players in any of the quadrants but this picture is reflective of the number of BI Validator downloads we are seeing on a regular basis. Based on customer demand and experience from supporting other BI platforms, we are currently supporting the following test plans for Tableau.

Test Plan #1: Regression Testing of Workbooks / Views

Business analysts and testing teams can baseline and compare workbook PDF snapshots to identify data and layout differences. This drastically cuts down the time it takes to test for regression in Tableau workbooks.  Similar to baseline and comparison of workbooks, comparison of PDF snapshots of views is possible as well. This will identify visual as well as data differences with ease. 

Test Plan #2: Upgrade Testing

This test plan in BI Validator empowers testing teams to compare pre and post upgrade workbooks/views or across environments with ease. This is helpful for out of place upgrades or when folks want to compare QA environments to Product environments or similar use cases.

Test Plan #3: Migration Testing

The magic quadrant has been truly magical over the last few years. Customer preferences have been changing at a rapid pace and so is the landscape. When I saw the above image few weeks back, I was surprised to see that the traditional BI players do not even show up in the map. Many customers have already deployed Tableau at an enterprise scale or at least in the process of choosing BI platforms such as Tableau/Qlik for the ease of rollouts and other factors.

To aid this migration, BI Validator can come in extremely handy because you can easily compare data of reports generated by OBIEE, Business Objects, Cognos to Tableau. This minimizes the risks associated with large scale migration projects.

Test Plan #4: Performance Testing

Often, it is important to understand how a specific report performs over time. A performance test plan can be defined that will ensure that the report will respond within an acceptable time every time end users try to garner insights from the reports in a production environment.

Test Plan #5: Security Testing

During internal rollouts or upgrades, it is possible that the access levels of Projects, Workbooks  & Views may unexpectedly vary and it is important to understand these early in the development life cycle to prevent compliance issues or other operational oversights. This test plan can baseline and compare the access levels of Tableau Users/Groups.

Test Plan #6: Stress Testing

This is an amazing test plan that completely eliminates the need for tools such as JMeter and LoadRunner. No more scripting. You can easily simulate concurrent user load and gather view performance statistics with varying user load; did I say using a simple user interface?

Test Plan #7: End to End Testing

We advice our customers to do end-to-end testing where applicable. Using the Data Compare test plan, you can connect to your database, write a query and then compare the data set to the data in a Tableau View in a matter of minutes. This is truly remarkable.

With these 7 test plans, you are already way ahead in your testing automation journey of the Tableau Platform. So why wait? Download BI Validator and test drive today. World class support is just an email away (contact@datagaps.com).