Complex Structural Hierarchical JSON and XML files

Why are complex structural and deeply hierarchical JSON and XML files so difficult to validate?

While most google searches for “JSON Validation” led to validation tools and articles that focus on structure, a severely under-looked topic is about validating the data within these complex files. The same applies to XML files. Created by and for applications to communicate a multitude of variables that are interlinked via non-linear many-to-many methods, these files become increasingly complicated, especially to untangle and validate properly.

Fig. Raw JSON

Raw JSON

A common use case seen apart from JSON/XML data validation only is the inclusion of these file types into more traditional datasets. A typical DB with a good number of related views and tables suddenly has to ingest a set of complex JSON files. While parsing and pulling individual data values out from the JSON itself is deemed highly situation specific and requires special functions, validating these data values is usually a nightmare for the developers.

As seen in image even loading a JSON or XML as a data frame into Spark results in a complex dataset that cannot be combined with traditional datasets.

Raw Loaded Tables

Fig. Raw Loaded Tables

As XMLs and JSONs get more complex, the amount of structured columns consisting of nested arrays, key pairs, and lists only increases. While this is a start, we cannot validate any of these values or use the specifics unless we use complex functions to extract individual values. Here we introduce the auto-parsing capabilities in DataOps Suite.

Auto-Parsing and Validation of JSON and XML

The DataOps Suite holds a few custom libraries that easily parse JSON and XML files with a single command. These commands flatten different types of JSONs and XMLs into a set of relations datasets. By converting the complex sub-structures of these files into separate views with related primary keys, validation, application of data rules, profiling and analysis can be applied just as views.

In the example, the two complex sub-structures are broken into separate views with a key-relation table. As the key pairs for the various bases (names) remain consistent, the functions ensure that no excessive data is created. In the example, each type of donut has a sub-set of the pre-existing list of batters and in the tables created we see only the distinct values and a combination of ids to correctly tag the list of values.

The segregated views can either be analyzed and validated separately to check the complete list of values and key-pairs combinations or these views can be combined with filters to perform node-specific validation.

Breaking sub-structures into Separate Views

Breaking sub-structures into Seperate Views

One of the most important aspects that might be overlooked is the fact that the tables and sub-tables created by the function are part of the relational DB system. That is, these datasets have assigned connection keys and relations built the instant they are made. This makes it to that any Query Builder system, ETL Pipeline, or DB system can instantly work with related and work with them easily without the user having to keep track of the complex relations or joining keys.

In the example, the user makes a view to taking into consideration only the combinations of “Batters” and “Toppings” of an “Old Fashioned” Donut. This is later run across a set of data rules which include a domain check on the toppings, a null check, a duplicity check, and a space character check.

Performing a similar set of tests with custom tools on specific nodes will require a framework to be developed and maintained and each new set of rules, analysis, and profiling would require a lot more setup time and maintenance resources. This also excludes the reporting system that would have to be created and connected to the customized validation system.

Try DataOps Suite – Free Trial

Donut Rules

Fig. Donut Rules

In the Suite, an XML file is imported into the flow, converted into relational views, then validated across a set of rules and profiling systems. XML files are notorious to use outside the intended applications and therefore there are fewer validation systems that work to validate the values.  The flow showcases how a typical validation and analysis flow for an XML file will look in the Suite. 

Overall Data Flow

Fig. Overall Flow

APIs with Query Builder

The most common place to find JSONs or XMLs in a relational DB or warehouses is via APIs which come up with specific use cases. And with complex JSONs or XMLs come complex queries. In these specific cases, there are a few reasons why using the query builder provided by Datagaps can be really beneficial.

First, query builders typically provide a user-friendly interface that allows users to easily access and manipulate data without having to write complex SQL code. This can be especially helpful for non-technical users who may not be familiar with SQL or may not have the technical expertise to write queries themselves. Second, query builders often support a wide range of data manipulation functions, such as filtering, sorting, and aggregation, which can be used to transform and analyze data in various ways. This can be useful for exploring and understanding large datasets, and for identifying trends and patterns within the data. In addition, query builders can also be used to extract and flatten nested data from JSON or XML files. This can be useful for breaking down complex data structures and transforming the data into a more manageable format that is easier to work with.

Overall, using a query builder to work with data from JSON or XML files that have been pulled via an API can provide a more convenient and powerful way to access and manipulate data, and can help users to unlock the full potential of the data.

Query Builder with Relational Datasets in the JSON

Fig. Query Builder with Relational Datasets in the JSON

Conclusion

Extraction and Parsing of JSONs and XMLs which are under constant updates with structural and value type changes need a flexible solution that can be easily altered to handle any of the aforementioned changes, be it a structural change, a data type change, or anomaly detection. As most DBs work entirely with default style views addition of complex columns disrupts the flow of the quality check and pipelines.

The combination of the parsing and extraction functions that create relational views working in tandem with the data quality nodes provided by the Datagaps DataOps Suite ensures that even with routine changes of the complex XML and JSON files validating and analyzing the datasets as well as detecting anomalies in these sets is only a few clicks away.

Get a Free POC scheduled today!

Request Demo

Datagaps

Established in the year 2010 with the mission of building trust in enterprise data & reports. Datagaps provides software for ETL Data Automation, Data Synchronization, Data Quality, Data Transformation, Test Data Generation, & BI Test Automation. An innovative company focused on providing the highest customer satisfaction. We are passionate about data-driven test automation. Our flagship solutions, ETL Validator, Data Flow, and BI Validator are designed to help customers automate the testing of ETL, BI, Database, Data Lake, Flat File, & XML Data Sources. Our tools support Snowflake, Tableau, Amazon Redshift, Oracle Analytics, Salesforce, Microsoft Power BI, Azure Synapse, SAP BusinessObjects, IBM Cognos, etc., data warehousing projects, and BI platforms.  www.datagaps.com 

Queries: contact@datagaps.com