What Is Data Drift?
What is Data Drift? Within the data space, the only constant thing is “change”. The drift in data here refers to a multitude of changes in the input data primarily in terms of frequency, aggregates, and heterogeneity. These are not regarded as errors as these types of shifts and changes in datasets are factual and representative of the real-world data and how it now differs from the existing benchmarks that a pipeline, analysis, or ML model was initially based on.
Model Drift comes as the other side of the coin to data drift that is closely related to how the statistical nature and the probabilities as well as the intended logic translation have been altered. While model drift is closely associated with AI-ML models, data drift affects every pipeline that has been made using past production data.
A quick way to comprehend data drift is via a couple of real-world examples
An ML model that predicts house prices based on a myriad of property aspects such as the number of rooms, area, location, floor, and such which was originally trained in 2019 will not work correctly in 2020 due to the variety of changes in the aforementioned aspects. Certain areas went up in demand as did a certain number of bedrooms and such. If the model is not re-trained or corrected with updated bias, the predicted prices cannot be used.
Assume a statistical regression-based model predicts if a customer might default on a loan. The bank’s majority of clients at this point were new families. A few months after the model has been running, the marketing department unveils a new type of campaign targeted toward young students. While the campaign is successful the model is no longer accurate as there are new types of distributions among the various inputs the model is fed. Therefore, the prediction of defaulters itself is incorrect.
A reporting system that showcases the mean forecasts across multiple regions suddenly has a higher mean temperature than expected. Under the hood, a few areas had updated their sensors to one of a different brand that resulted in the dimensions being recorded in Fahrenheit as opposed to Celsius on which the system was based.
A couple of distinctions in the various types of data drifts are the cadence of the drift and the type of the drift. The use cases showed a focus on the type of drift. The cadence of drift segregates drifts into 4 types. These are Sudden Drift, Gradual Drift, Incremental Drift, and Reoccurring Drift. These are usually defined against data distribution and time, but the concept translates with specific aggregates of the metrics themselves.
Figure 1. Different Classifications of Drift
Figure 2. The above graph showcases a “Sudden” Drift in Yearly Income where the overall values of the metric have increased sharply
Profiling as Drift Detection in Data Drift
Data Profiling is an integral part of the DataOps Suite that helps users create profiles that hold every aspect of information that can be derived from a dataset such as various types of aggregates such as mean, deviations, min-max, nulls, and more along with Frequency and Pattern Analysis.
A dataset can be directly pulled into a profiling node. The DataOps Suite Profile node provides a variety of aggregation and pattern analysis options.
Figure 3. DataOps Profile Node
Each of the aggregations works to create a profile of the dataset, maintaining an average value, upper and lower bounds, deviations, patterns, null counts, and such. This help creates a baseline of the expectations in the datasets and something for the users to use for comparisons. Let’s have a closer look at a few real-life examples.
Data Drift & Variety of Drift Detection – Covariate Drift or Drift in Metrics Stats
Every numerical metric holds certain statistical aggregates that can help keep the baseline of the dataset. The most basic ones of these are average, min-max values, and standard deviation. Skewness and Kurtosis also help keep the distribution in check.
A change in mean implies that in general the average value of the metrics has been altered. In the example below, the yearly income has overall increased.
Figure 4. Mean
While Min-Max values show the upper and lower hard bounds of the metrics, the variability, and the weights away from the mean are showcased by the deviation. In the example we see that while the min and max values of the yearly income have shifted up with the mean, there is less variance in this metric as well, implying that there is less variance in the customers.
Figure 5. Minimum Value
Figure 6. Maximum Value
Figure 7. Standard Deviation [The decrease showcases that most of the values in the past 2 runs are much closer to the mean]
Skewness identifies how skewed a dataset is, as in how many values lie evenly away from the mean in both directions while Kurtosis identifies the degree of curve of the distribution of a dataset. Any changes in these datasets represent changes in the distribution and therefore critically affect any statistical tests like the p-test or t-test. In our example, these do not alter as much, however, in more sensitive models such as an AI / ML model, these tiny changes would affect the results more drastically.
Figure 8. Skewness
Figure 9. Kurtosis
Change in Keys / GUID
A GUID or a primary key is on the most important columns in relational datasets. In terms of delta datasets, they are critical in ensuring duplicity doesn’t enter the system. Any changes in these patterns will result in incorrect aggregations and reports, especially when checked against pre-change datasets.
In the example above, we see that pattern of the Customer Key was a 5-digit number which was suddenly updated to an alphanumeric key.
Domain Shift or Addition of New Values
As per the example in the introduction of this blog post if a new campaign type or a new geography id is added to a system the corresponding joins have to be checked. Additionally, if geography id is one of the group columns for any aggregations the aggregates in question are affected as well. While addition that is not in the expected domain is ruled out as a bad record, segregation in teams can result in new validated domain LOVs that the analysis team might not be aware of.
Figure 10. Distinct Count
In the example, we see the addition of a few geography ids causing the number of distinct values to vary as well as changes in the distribution of the customers in various geographies.
Model Drift and Final Thoughts
Model Drift is the other side of the coin that is affected mainly due to data drift. It refers to degradation in model performance due to changes in data and outdatedness of the model parameters. In a machine learning system only fixing the data drift will not be sufficient and separate techniques will have to be used to detect model drift against production data and model.
Data Drift affects not just ML models but any system that works with functions, aggregates, and systems where statistical tests are being performed. Gradual changes over time creep up in the datasets resulting in lowers data and model quality. If not tracked consistently, figuring out the exact aspects that have changed is a much more complex and difficult task.
Detection of Data Drift is often a de-prioritized task that can be easily deployed, documented, and monitored using the DataOps Profiling Nodes. This will ensure that any type of drift, be it in metrics, domains, patterns or stats can be identified early before the drift causes severe dips in the model quality.
DataOps Suite – Free Trial
The Datagaps’ DataOps Suite now comes with new components that add extensibility and connectivity with other applications as well as a focus on ease of creating tests by automatically creating SQL Queries and identifying anomalies based on data profile.
Try DataOps Suite Free for 14 days…
Established in the year 2010 with the mission of building trust in enterprise data & reports. Datagaps provides software for ETL Data Automation, Data Synchronization, Data Quality, Data Transformation, Test Data Generation, & BI Test Automation. An innovative company focused on providing the highest customer satisfaction. We are passionate about data-driven test automation. Our flagship solutions, ETL Validator, Data Flow, and BI Validator are designed to help customers automate the testing of ETL, BI, Database, Data Lake, Flat File, & XML Data Sources. Our tools support Snowflake, Tableau, Amazon Redshift, Oracle Analytics, Salesforce, Microsoft Power BI, Azure Synapse, SAP BusinessObjects, IBM Cognos, etc., data warehousing projects, and BI platforms. www.datagaps.com