Over the last few years, we have been working with a number of customers across various industries such as Life Sciences, Financial Services, Higher Education etc.  While the problems related to data testing are similar across industries and JSON and flat files are very common, there are many differences in the file formats.  In few industries, Avro is popular while in others, based on use case, Parquet is more commonly used.

At Datagaps, one of the key challenges is our ability to empower customers to test any of the formats with ease. Of course, we have an option to build native connections for each format but that may not be the best option sometimes.  At Datagaps, we try to leverage open standards, open source frameworks to support our customers. This is where Apache Drill (https://drill.apache.org/) comes into play.

Over the last 6 months, we have been drilling around a bit and absolutely love the speed and flexibility that Apache Drill provides. As of version 3.4.5, we use Drill as the interface between ETL Validator and any of the file formats mentioned above (except flat files since flat files are much more common and deserve native connectors). In this blog, I wanted to take few minutes and explain how easy it is for you to get started with Drill and integrate with ETL Validator.


1. You are running this proof of concept on Windows.

2. You have ETL Validator Complete 3.4.5 or higher installed on the machine.

3. You have Mongo running on the same machine.

Apache Drill Installation

Step 1:  Go to https://drill.apache.org/ and the get the latest version of  Drill. In this blog, we assume version 1.9 of Drill.

Step 2: Using 7z or a similar tool, un-archive the file.

Step 3: Navigate to the bin folder of Apache Drill and then execute the following command “sqlline -u jdbc:drill:zk=local”. This will start the Apache Drill server and now you are ready to define your Mongo database.  Once Drill is started successfully, you should see something like this:

Step 4: Launch the web interface to Drill using https://localhost:8047. This interface should be available once the command in step # 3 is run successfully and should look like this:

Step 5: Navigate to  “Storage” tab and enable the Mongo storage in Apache Drill. Since Mongo is running on the same machine, you do not have to make any changes. Your Mongo setting should look like this:

Setting up Mongo connection in ETL Validator using Drill

Step 1: Launch ETL Validator

Step 2: Create a new “Mongo through Drill” Connection.

  • Name: Can be anything you want.
  • User Name & Password: You can put null/null if there is no username/password. This is a know issue and will be fixed in upcoming versions.
  • Connection Type: Leave it as Auto
  • Host Name: You can provide localhost.
  • Port: Leave the default port on which Drill must be running.
  • Retrieve and select a schema from Mongo.

Step 3: Test the Connection.

Now, you are ready to execute Query Comparison or other test plans in ETL Validator. Few sample queries are below. Depending on the complexity of the JSON structure, you may need more time to understand how to get the right query.

  • SELECT * from mongo.test.`addresses`;
  • SELECT  ad.address.building FROM mongo.test.addresses as ad
  • SELECT  ad.address.coord FROM mongo.test.addresses as ad

The above queries assume that the JSON is as follows and these documents are in the mongo.test schema.

{“address”: {“building”: “1007”, “coord”: [-73.856077, 40.848447], “street”: “Morris Park Ave”, “zipcode”: “10462”}, “borough”: “Bronx”, “cuisine”: “Bakery”, “grades”: [{“date”: {“$date”: 1393804800000}, “grade”: “A”, “score”: 2}, {“date”: {“$date”: 1378857600000}, “grade”: “A”, “score”: 6}, {“date”: {“$date”: 1358985600000}, “grade”: “A”, “score”: 10}, {“date”: {“$date”: 1322006400000}, “grade”: “A”, “score”: 9}, {“date”: {“$date”: 1299715200000}, “grade”: “B”, “score”: 14}], “name”: “Morris Park Bake Shop”, “restaurant_id”: “30075445”}

{“address”: {“building”: “469”, “coord”: [-73.961704, 40.662942], “street”: “Flatbush Avenue”, “zipcode”: “11225”}, “borough”: “Brooklyn”, “cuisine”: “Hamburgers”, “grades”: [{“date”: {“$date”: 1419897600000}, “grade”: “A”, “score”: 8}, {“date”: {“$date”: 1404172800000}, “grade”: “B”, “score”: 23}, {“date”: {“$date”: 1367280000000}, “grade”: “A”, “score”: 12}, {“date”: {“$date”: 1336435200000}, “grade”: “A”, “score”: 12}], “name”: “Wendy’S”, “restaurant_id”: “30112340”}

In the next blog, we will show you an example of comparing JSON data with table data using the Query Compare Test Plan. Stay tuned.