Data Sentinel: Automating data validation
March 10, 2020
Co-authors: Arun Swami, Sriram Vasudevan, Sailesh Mittal, Jiefu Zheng, Joojay Huyn, Audrey Alpizar, Changling Huang, Maneesh Varshney, Adrian Fernandez
Data’s value is best realized when prepared and treated correctly. However, when you’re working with data at an extensive scale, it’s not as easy to make sure that every data set has been cleaned and validated. Back in October 2018, we had an instance at LinkedIn when data quality problems affected the job recommendations platform. Client job views and usage declined by 40 to 60% for a short period of time. Once this decline in views was detected, it took a total of 5 engineers 8 days to identify the root cause and 11 days to resolve the issue. This incident illustrates several takeaways about data quality:
- Poor data quality is difficult to detect and can have significant business impact
- Data debugging is difficult and requires significant engineering resources
- Resolving poor data quality not only requires timely and correct intervention, but also results in significant opportunity costs in potentially delaying other projects and deliverables
This led us to develop Data Sentinel, a platform that automatically validates the quality of large-scale data in production environments through advanced data mining, data management, and software engineering techniques. Today, we’ve expanded the use of Data Sentinel to validate over 800 datasets, saving countless developer hours at LinkedIn.
What is data quality?
The previously mentioned story conveys the importance of data quality. But what exactly is data quality? In short, it captures the fitness of data to be used to meet business requirements.
Data quality spans many dimensions: accuracy, integrity, validity, accessibility, access security, relevancy, timeliness, completeness, consistency, conciseness, and interpretability, to name a few. Below, we’ve outlined the types of data quality into 4 subfields:
|Types of Data Quality||Dimensions|
|Intrinsic||Accuracy, integrity, validity|
|Accessible||Accesibility, access security|
|Contextual||Relevancy, timeliness, completeness|
|Representational||Consistency, conciseness, interpretability|
The subfields of data quality, as defined in a paper from Communications of the ACM
In this blog, we will focus on intrinsic data quality. Real-life examples of intrinsic data quality include:
- Given a list of names, verify that all provided names are not blank (e.g., if name = “”, then this is an invalid name)
- Given a list of people’s ages, verify that all provided ages are valid (e.g., no negative ages)
- Given a list of country names, verify that all provided country names refer to countries that exist (e.g., “foo” is an invalid country name)
Why is data quality important?
Many organizations process big data for important business operations and decisions. As a metric of success, quantity of data is not enough—data quality must also be prioritized.
A study from The Data Warehousing Institute estimated that data quality problems cost U.S. businesses more than $600 billion a year. According to Communications of the ACM, three proprietary studies estimated that the total cost of poor data quality ranged from 8 to 12% of revenue, and poor data may consume 40 to 60% of a service organization’s expenses. Despite the staggering costs of poor data quality, this problem persists for all lines and sizes of businesses. Having performed measurements at the data field level, many case studies, from this same article, reported field error rates varying from 0.5% all the way up to 30%.
Addressing data quality with Data Sentinel
With data mining, Data Sentinel discovers properties, anomalies, and insights from data. Data mining, according to Data Mining: Concepts and Techniques, 3rd Edition by Han et al., is defined as “the process of discovering interesting patterns and knowledge from large amounts of data.” See below for examples of how specific data mining methods and techniques are leveraged in Data Sentinel.
|Propositional logic||Discovering and asserting that a set of propositions (e.g., all numeric values of a particular field fall within a specified range) hold true|
|Statistical independence testing||Comparing the distributions of values between 2 fields|
|Computational engineering||Implementing AI and statistical methods in a scalable manner with SQL to run in a database or data-intensive system|
|Query optimization||Efficiently computing multiple statistics or propositional statements (these describe properties of the data)|
|Data visualization||Visualizing discovered properties, anomalies, and insights from data|
A few of the data mining methods and techniques employed in Data Sentinel
After mining this knowledge, Data Sentinel compares it with the expected properties of data that is of good quality. From these comparisons, Data Sentinel generates validation reports describing whether the data is of sufficient or of passable quality.
To validate large-scale data in production environments, Data Sentinel leverages data management and software engineering concepts to make it easy to use, understand, and run in production environments. Some of these concepts include the following:
- Declarative configurations: users declaratively specify data checks to perform on a dataset of interest in a simple and understandable configuration file
- Parsing, interpretation, and dynamic code generation: Data Sentinel parses the configuration to generate optimized SQL queries that perform the data checks specified in the configuration
- Distributed computing: Data Sentinel leverages Apache Spark to perform the specified data checks on large-scale datasets in a scalable manner. (If you are not familiar with Spark, think of it as a SQL database-like system for the purposes of this blog)
- Schemas: The input configuration and output validation reports conform to a schema that enables other software systems to parse and consume these files (users can also view the validation reports in a UI—example shown below)
UI display for part of an example validation report
The following steps describe important processes in a typical Data Sentinel workflow:
- Users identify a dataset of interest to validate
- Users declaratively specify data checks to perform on this dataset in a configuration file
- Data Sentinel loads the dataset into main memory and parses the configuration file
- Data Sentinel generates and executes optimized SQL queries that perform the data checks specified in the configuration
- Based on the results of the performed data checks, Data Sentinel generates a dataset profile (contains statistical summaries of discovered properties and insights of the dataset) and a validation report (contains the results of the performed data checks and their diagnostics—e.g., why the data check passed or failed, faulty records that caused the data check to fail).
- Intervention or further processing of the dataset profile and validation report
A simplified Data Sentinel workflow diagram
Let’s take a look at how Data Sentinel works. Suppose you would like to validate the quality of the following dataset containing employee records:
To do so, you propose the following data checks:
- Employee ids are unique and not null
- Email addresses are valid (in this case, let’s say that valid email addresses follow the format email@example.com)
- Ages are valid
Then, you declaratively specify these data checks in an input configuration to Data Sentinel. This configuration will look like this:
Note that this configuration communicates the following to Data Sentinel:
- The intent to validate the values of the dataset fields employee_id, email_address, and age.
- A command to perform a corresponding set of 1 or more data checks for each field.
Given the configuration and dataset, Data Sentinel executes the corresponding data validation job. First, Data Sentinel loads a subset of the dataset into a special Apache Spark data structure, whose contents are distributed across the underlying Spark cluster. Note that this subset contains only the specified fields to be validated. Next, Data Sentinel parses this configuration and generates an execution plan containing a sequence of optimized SQL queries. With these queries, Data Sentinel scans the dataset to mine two bodies of knowledge:
- Dataset profile: Statistical summaries of the dataset fields specified in the configuration
- Validation report: Results of data checks specified in the configuration
For efficiency, Data Sentinel will first compute the dataset profile. Then, it uses the profile’s statistical summaries to compute the validation report.
After Data Sentinel computes the dataset profile and validation report, users have a couple of options for next steps. They can simply examine the contents of the profile and report. If one or more of the specified data checks failed, users can instruct Data Sentinel to block the dataset from flowing further downstream the workflow of jobs. Alternatively, other programs and software systems can further process the profile and report.
Data Sentinel adoption at LinkedIn
With data mining and software engineering techniques, Data Sentinel has identified bugs in development and production workflows by flagging poor quality data and has prevented software systems at LinkedIn from consuming bad data. It has also caught more insidious issues, such as data skew and duplicate examples in datasets. These can result in poor data analytics and statistical machine learning models.
Some success stories from our teams include:
- Leveraging Data Sentinel to discover duplicated work anniversary data and primary keys for organization data.
- Data Sentinel helped a team that works with member and jobs data 1) discover duplicated data, 2) intervene and prevent corrupted data from being pushed to a database.
- Data Sentinel helped a team that works with recruiter data discover duplicate records, thus preventing the learning of biased machine learning models from this corrupted data.
Following its widespread success and added value at LinkedIn, Data Sentinel continues to undergo intense development to improve its capabilities. These exciting efforts include the following:
- Implementing more data mining methods based on AI, statistics, and machine learning to perform data checks
- Discovering and recommending data checks for users
- Validating data in an online streaming fashion (as opposed to the current offline batch-processing approach)
- Leveraging self-driving database techniques, as referenced in this paper, to improve the performance of data validation jobs
At LinkedIn, we are excited to continue pushing the frontiers of data mining, data management, and software engineering to address data quality problems. We hope that Data Sentinel will not only raise awareness around the importance of data quality, but also inspire concepts of “testing coverage” and health metrics for datasets to be incorporated into software engineering and big data analytics.