Towards data quality management at LinkedIn
June 9, 2022
Data is at the heart of all our products and decisions at LinkedIn and the quality of our data is vital to our success. While not an uncommon problem, our scale, hundreds of thousands of pipelines and streams as well as over an exabyte of data in our data lake alone, presents unique challenges. Most pipelines have complex logic involving a multitude of upstream datasets and common data quality issues in these datasets can be broadly classified into two categories: metadata and data semantic.
The metadata category of data quality issues concerns data availability, freshness, schema changes, and data completeness (volume deviation), which can be retrieved without checking the content of the dataset. This kind of data health information can be collected from the metadata repository (e.g., Hive metastore and related event) or storage services (e.g., HDFS), i.e., the purely metadata of the datasets. The semantic category of data quality issues concerns the content of the datasets, such as column value nullability, duplication, distribution, exceptional values. etc. This kind of data health information can be collected from the data profile (if it exists) or by scanning the dataset itself, i.e., the data content. To lay a solid foundation for tackling data quality, we must attack the problem at the root. In this blog post, we focus on the metadata category of data quality issues, and demonstrate how we can monitor the data quality of datasets at scale: data availability, data freshness, data completeness, and schema changes.
A typical AI use case
We are using a typical AI use case to illustrate the challenges when addressing data quality issues. The following figures show the actual arrival time and data volume of a dataset that is being consumed by the AI use case. The dataset is expected to arrive on a daily basis before 7 a.m. and for the data volume, there is a weekly pattern (i.e., seasonality) when the volume during the weekend is lower.
It is quite common that a complex pipeline consumes tens of (if not more than a hundred) upstream datasets like the one we illustrated. These upstream datasets may be ingested into the storage or created by different pipelines at different times with different arrival frequencies. For example, some datasets are produced on a monthly, weekly, daily, or hourly basis while some datasets are produced several times a day so that the latest version can be used. Some are static in nature (updated very occasionally) and others are continuously ingested into the system.
All of these complexities make it difficult to reason about the upstream dataset quality and in many cases, awareness in following aspects can greatly help the operation of these complex pipelines:
Data freshness/staleness: An older version of the dataset may be used unintentionally in the pipeline logic (due to late arrival of the latest version), or the pipeline is stopped due to a late arrival. In both cases, it can downgrade the overall data quality. In many cases, the pipeline may accept reading slightly stale source data but the users must be aware that the quality of the output may degrade over time.
Data volume deviation: A sudden drop in data volume could mean only partial data being generated. If the situation went undetected, the output dataset would be inaccurate. If there was a steep increase in data volume, the pipeline logic might fail due to insufficiently allocated system resources (e.g., memory) causing a delay in delivery of the output dataset.
Schema change: A schema change (e.g., adding a field to a structure) may break some downstream pipeline logic unintentionally, leading to pipeline failure.
The data health can be greatly improved if the following challenges can be tackled systematically while addressing these issues.
In LinkedIn, there are hundreds of thousands of datasets that reside on multiple Hadoop clusters. Some datasets are replicated across clusters, either by copying or being created in parallel for redundancy. On the other hand, other datasets are served for different purposes (e.g., all sensitive information is obfuscated). A monitoring solution must address the scalability issue and offer low overhead, short latency, and monitoring-by-default (i.e., no manual onboarding process) capability.
LinkedIn has a huge number of data pipelines and each consumes a large number of datasets and has different expectations on dataset availability. Let’s consider a pipeline reading a dataset which has a SLA of daily arrival at 8 a.m. If the pipeline won’t start until 2 p.m., does missing this SLA (with its actual arrival time at 10 a.m.) have any business significance to the pipeline? If the dataset is backfilled by noon, this particular SLA miss is irrelevant to the data consuming pipeline but receiving an alert of the SLA miss at 10 a.m. can be annoying, leading to email alert fatigue (to be discussed further in this blog). There are multiple consumers and each has its own expectation on the input dataset’s availability. In other words, even if producers miss the SLA commitment, the misses may not impact all of the consumers and the real impact depends on the consumption usage.
There are many different types of datasets at LinkedIn with offline datasets classified as either snapshot or partitioned. However, datasets can have different arrival frequencies: monthly, weekly, daily, multiple-times per day, or hourly. When a pipeline consumes any of these datasets, defining the freshness staleness of the dataset can be a challenge. For example, if a pipeline reads a dataset that is refreshed several times a day and there is a delay for the latest arrival (its producer missed the SLA), does this indicate that the pipeline is reading stale data even though the pipeline can read an earlier version that arrived on the same day? Reading an earlier version may have little business impact on the output but it is important that the consumers be aware of the situation.
Engineers receive various email alerts throughout the day, some of which might be false or duplicate alarms. Due to this, engineers are faced with email alert fatigue and sometimes don’t pay close attention to email alerts that indicate issues or failures. Additionally, different expectations on dataset availability also magnifies the fatigue issue when the consumers receive too many alerts that are technically valid but not relevant. This leads to the mishandling or overlooking of any real data quality issues. Consequently, reducing email alert fatigue is a paramount goal, and at the same time, including possible remedial action or information in the email alert is critical to the success of the monitoring solution.
Existing solutions often require the dataset owners to grant explicit read permission to the monitoring service. This extra step is a manual step and increases the friction in adopting the solution at scale.
Data Health Monitor (DHM) architecture
Figure 1. Data Health Monitor system architecture
The high level system architecture of DHM is shown in Figure 1 and is divided into three phases:
Observation: DHM leverages low-latency Hive metadata and HDFS audit logs as the sources of truth. Data health vital signs are collected automatically at scale, such as dataset or partition arrival time, freshness timestamps, and schema. DHM continuously gathers new audit logs and derives new vital signs, which will be persisted on HDFS. With this approach, all Hive/HDFS datasets will be monitored by default without additional onboarding steps.
Understanding: After collecting data health vital signs in the repository for several days automatically, DHM can safely infer several key dataset properties with high accuracy. They include the arrival frequency and average arrival time. Most of the datasets are ingested or created on HDFS on a regular basis such as daily, several times a day, or hourly. There are other kinds of ingestions as well: monthly, weekly, static, or unknown. For example, DHM can infer that a new dataset partition has been created daily at 8 a.m. (on average). The inferred arrival time and frequency will then be used as a default assertion on the expected dataset's arrival. Dataset consumers can easily revise the expected arrival time according to the consumer’s need via a self-service UI. Using the previous example, the consumer would like to be alerted if the partition did not arrive before 11 a.m. and users can even change the Arrival Frequency. For example, users may want to receive alerts once a day for hourly datasets. Based on our experience, some data pipelines may be less sensitive to the freshness of the datasets. For example, a dataset may be arriving once a day, but there aren’t many changes that would actually affect the output dataset’s quality. Instead of potentially being spammed on email alerts on a daily basis, users may revise the assertion to receive an alert at a different time, e.g., an alert if there wasn’t a new arrival in the past 10 days. This capability requires assertion evaluation just before the alert and cannot be scheduled based on the dataset arrival pattern. The architecture offers a lot of flexibility to reduce email alert fatigue.
Reasoning: DHM monitors all datasets for several key data health events (like arrival and schema changes) and reasons about the health status based on those inferred key properties such as arrival frequency and time. The reasoning results form the basis for email alerting. However, before sending out email alerts, DHM will resolve if the identified health issue is a duplicate, and if so, the email alert will not be sent to avoid spamming. If any users want to receive email alerts, they may create a subscription to the dataset via the DHM self-service UI. As stated earlier, users may revise the inferred properties (such as changing the expected arrival time) according to the user’s expectation. The email alert contains relevant information that includes the reasoning basis and related context information for users to take action.
The previous three steps compose the data health monitoring services and automatically discovers all of the datasets in the ecosystem and collects their data quality metrics. The collected data quality metrics in the data warehouse provide a source of truth of data quality information that can be leveraged by users, such as creating a data health dashboard or performing analytics on data health, like trending analysis.
The Data Health Monitor project has become generally available (GA) since August 2021 for the LinkedIn community and has been widely adopted in many organizations such as the marketing and AI/ML teams. As of now, all HDFS/Hive datasets are being monitored by DHM across several Hadoop clusters. DHM collects 1B data health vital signs for about 150k critical datasets on a daily basis, and this demonstrates DHM’s scalability and performance.
Dataset consumers can create a subscription on any dataset and will receive email alerts according to the subscription settings. There are several important metrics that we use to measure the success of DHM:
Subscriptions (with an email alert) reached more than 2,000 in several months after becoming generally available
On average, DHM sent out roughly 1,500 alerts weekly
Over 98% of the alerts are real-positive and accurate according to the subscriptions, by validating the alerts with realtime dataset status.
The DHM alert SLA (the time between detecting a health issue according to the subscription and sending out an email) has been reduced to roughly 30 minutes after GA optimization
Conclusion and future work
DHM is a scalable data health monitoring service that has been deployed widely across the LinkedIn community and can handle large scale dataset subscriptions with low latency and high precision alerts. The DHM is a self-service solution that requires no manual onboarding steps and offers a simple UI for simple alert configuration.
DHM is still considered as a “monitoring-after-the-fact” or diagnosis solution, meaning that a data health issue has already occurred, and possibly some damage might have been done, such as adopting partial or stale data to compute the new dataset, flow failure, etc. With our ability to shorten the latency in the detection and accurate diagnosis, we plan to incorporate several preventive care capabilities such as flow failure prediction due to data volume changes or severe data skew.ealso intend to provide a complete data quality solution that addresses both metadata and semantic categories.
We would like to thank all current and former DHM team members (especially Ting-Kuan Wu, Silvester Yao, and Yen-Ting Liu) for their design/implementation and contributions to the project. Furthermore, we thank our early partners who provided us with valuable and timely feedback.