
DataFlow is a powerful application using which you can easily perform end-to-end automation of a data migration process. In DataFlow, there are different kinds of components to serve different purposes. One of them is the Code Component. It supports three kinds of languages. Spark SQL, Scala, Python. Using the Code Component, you can write queries or code on top of different datasets created in the current DataFlow. This gives you flexibility to do different DML operations on top of the existing datasets. This document covers reading the data from a Rest API, converting into a dataset and comparing it with another dataset from file.
Please go through the following steps:
- On the left menu, select ‘DataFlows’.
- Click on ‘New dataflow’ button on the top right corner.
- In the ‘New Dataflow’ dialog, fill the details and Save.
- Name: Any random name to identify dataflow.
- Livy server: By default, the application comes with a default livy server. Select any configured livy server.
- In the ‘New Dataflow’ dialog, fill the details and Save.
- During the first run, all the list of components are displayed (as shown in the below image). Select Code component in the processor bucket.
- A new Code component will be opened with the ‘Properties’ tab selected. Fill the details and then go to the next step –
- Name: Any random name.
- Dependency: Not required as it is the first component.
- Description: Optional. You can give any useful info about the code component.
- Dataset name: Name of the dataset which you want to create using the code. You can give multiple names, separated by commas.
- A new Code component will be opened with the ‘Properties’ tab selected. Fill the details and then go to the next step –
- Select the kind. Scala is selected by default. The Code component supports Scala, Python and SparkR languages. By clicking on the ‘Sample API code’ button on the top right, a sample code will be populated to read data from Rest API. Here a sample code is provided.
Code:
import spark.implicits._;
var jsonStr =scala.io.Source.fromURL("http://192.168.6.42:9080/DataPrepRest/api/v1.0/templates/table?containerId=81&userName=sh&password=******&url=jdbc:oracle:thin:@192.168.6.76:1521:orcl&schema=sh&table=customers").mkString;
var df = spark.read.json(Seq(jsonStr).toDS());
df.createOrReplaceTempView("code_ds");
df.cache();
After execution of the above code a dataset with the name code_ds will be created. You can write multiple sets of such codes and can create multiple datasets. These datasets should be listed in the Properties tab as mentioned in the 5th step.
- Create a new File component. Fill the details and move to the next step –
- Name: Any random name.
- Data Source: File data source list can be seen here. You need to select a data source.
- Dependency: In the present case, there’s no need to give any dependency.
- Description: Optional. Write some basic info about the component.
- Dataset name: For File components, only one dataset will be created. Default name will be populated based on the component name. You can enter your desired Dataset name.
- Create a new File component. Fill the details and move to the next step –
- In the File step, fill in the details as shown below –
- File Name: Give the filename you want to read. Enter the filename manually or select a filename in the Files panel on the right.
- Encode: Optional (File encoding type).
- Options: Spark file read options. Some important options will be popped up with the default values. Please go through the following link for further info –
https://docs.databricks.com/data/data-sources/read-csv.html.
- In the File step, fill in the details as shown below –
- code_ds is the dataset created by reading Rest API data in the Code component. By default, all the data types will be considered as string. If you want to change these data types or column names, you can use the Attribute component. For any dataset, you can change the data types by using the Attribute component. Create a new Attribute component by using Add component.
Fill the details –
- Name: Any random name
- Source Dataset: For which dataset user wants to change data types and column names. In the current example, we are selecting code_ds which is the output of the Code component.
- Dependency: As this component can be run only after creation of code_ds dataset from the Code component, you must add the code component in the dependency list.
- Description: Optional (description of the component).
- Dataset Name: Output dataset name. After converting data types and column names, a new dataset will be created as the output. Default is the component name.
- In the rename step, enter the desired column names and data types.
- Save and Run the component.
- Now click on Add component and select ‘Data Compare’ from the ‘Data Quality bucket’ as shown below –
- In the ‘Data Compare’ component, you need to give two datasets as input. The comparison would be between these two datasets. Fill the details –
- Name: Any Random name.
- Dataset A: In this example we are selecting the output of Attribute component.
- Dataset B: In this example we are selecting output of File component.
- Dependency: In this example we are consuming datasets from file component and attribute component. So give both of them as dependencies.
- Compare type: Different comparison types you want to run.
- Description: Description of the component.
- Dataset Name: Default name will be populated.
- In the ‘Data Compare’ component, you need to give two datasets as input. The comparison would be between these two datasets. Fill the details –
- In the mapping step both dataset A and dataset B will be mapped by order of columns by default. If required, you can click on the “Remap by Name” button to reorder the mapping. You can select unique keys, then comparison would be based on keys. It allows multiple key columns. Move to next step after changes done.
- Run the component. Each component executes in a set of statements. Data Comparison component contains more number of statements and the progress of execution can be seen at the bottom in the ‘Run’ tab. After the Run is completed, you can see the component results as shown in the following images. At the bottom, you can see failed and passed statements. These statements contain Duplicate calculation, Only in Dataset A, Only in Dataset B, Differences etc. For each calculation a statement will be there. By clicking on the link, you can see the details of each statement.
- By clicking on the difference count statement, a window pops up. Please check the following image.
- Now the design of dataflow has been completed. It’s a one time step. You can run the dataflow whenever you want just by clicking the “Run dataflow” button at the top of the dataflow window. Then the following window opens –
This image is created based on the dependencies given. And it is the execution order of components. Here each color indicates the progress of different components –
- Green: Successfully completed and the status is passed.
- Blue: In Queue.
- Yellow: Running
- Red: Completed. But the status is failure.

Established in the year 2010 with the mission of building trust in enterprise data & reports. Datagaps provides software for ETL Data Automation, Data Synchronization, Data Quality, Data Transformation, Test Data Generation, & BI Test Automation. An innovative company focused on providing the highest customer satisfaction. We are passionate about data-driven test automation. Our flagship solutions, ETL Validator, Data Flow and BI Validator are designed to help customers automate the testing of ETL, BI, Database, Data Lake, Flat File, & XML Data Sources. Our tools support Snowflake, Tableau, Amazon Redshift, Oracle Analytics, Salesforce, Microsoft Power BI, Azure Synapse, SAP BusinessObjects, IBM Cognos, etc., data warehousing projects, and BI platforms.