Data Management

Shifting left on governance: DataHub and schema annotations

Co-authors: Joshua Shinavier and Shirshanka Das

Data governance is easy… as long as the data to be governed is small and simple. A handful of developers creating a startup company can get away with relatively lightweight solutions for managing their data, but things change as scale and complexity increases. Like a hermit crab outgrowing its shell, we constantly have to re-evaluate the tools and platforms we use for data management, as what worked well yesterday may or may not be adequate today. LinkedIn, for example, literally has millions of datasets in its DataHub data catalog, which is over two orders of magnitude more datasets than there are engineers to manage them. These datasets span dozens of platforms including Hadoop, Kafka, MySQL, and in-house systems, and are embedded in a complex ecosystem of data centers and data fabrics, each with their own requirements and constraints. LinkedIn is serious about putting members first. And, being an international company, the data we manage on behalf of our members is subject to a variety of data privacy regulations such as the GDPR and CCPA in addition to our own Privacy Policy. Compliance, security, and anti-abuse concerns all require a high degree of insight into not only what data we have and where it lives, but also what the data means and what responsibilities it entails for the company.

All of this calls for data governance solutions which rely heavily on automation, while making the most of the domain knowledge of our engineers. A complete, correct, and consistent inventory of data assets is an ideal starting point for any such solution, though this ideal can be quite difficult to achieve in practice, and requires constant diligence to maintain. At LinkedIn, two of the major challenges we have faced involve annotations, which deal with the meaning of data and its relationship with data policies, and ownership, i.e., confirming that we have a domain expert and steward for each of those millions of datasets. In this blog series, we will take a closer look at the challenges and the solutions in progress. Let’s start by talking about schemas.

Schemas and semantics

A schema, at a minimum, captures the structure of a dataset at a sufficient level of detail for indexing and querying. For example, it may specify a certain set of columns or fields, each with a name, a basic data type like integer or string, and possibly other, format- or platform-specific metadata which helps us store and access the data efficiently. A good schema also contains human-readable documentation as a guide to developers working with the dataset. However, a typical enterprise schema does not capture the semantics of the dataset in a machine-readable way; it does not tell us what the data represents, including the sorts of things in the real world which the data makes reference to – whether people, events, infrastructure, etc. – or the sort of statements the dataset makes about these things. In other words, a typical schema is simply not cut out for knowledge representation, which ultimately is what we need in order to decide which policies apply to the data and what we need to do in order to comply with them – to say nothing of deriving richer insights from the actual data.

schema-of-person

For this, we need a common controlled vocabulary, or ontology, for describing the semantics of data in each domain of interest, a way of annotating domain-specific schemas with terms from the ontology, which in turn are associated with data policies. In addition to classifying data assets along the dimensions of one particular data policy (such as LinkedIn’s GDPR compliance policies); we need to know the types of entities and relationships which are present in the data, and in turn connect those types to all policies which might apply.

To illustrate, the following is a very simple schema for a person table with four columns. The columns have simple types (here, int and string) but they correspond with business concepts such as name, email, etc. which are common to many tables.

schema-for-a-person-table-with-four-columns

A typical solution

A common solution for attaching terms from a controlled vocabulary to datasets and their columns is via the use of a data catalog like DataHub. The data catalog is responsible for extracting technical metadata from the source systems and representing them as datasets (or tables), and then humans (data stewards, governance teams) are responsible for attaching business metadata to these entities. Here is what the information flow looks like, in the form of a diagram:

image-of-the-common-solution-for-attaching-terms

Challenges

The solution described above works well to create a single unified schema that merges technical elements with business elements to ensure data can be discovered and governed in a holistic manner. However, there is room for improvement. The attachment of business metadata is after-the-fact; it happens after the dataset has been discovered and indexed in the catalog. Data and metadata are constantly changing in a modern data enterprise. In the time that it takes for a data steward to check, validate and annotate a dataset, many more datasets may have been introduced into the warehouse, and the existing dataset might also have evolved by adding new columns or, heaven forbid, dropping existing columns or repurposing an old column to carry new kinds of information. This constant change means that business annotations are often stale and inaccurate, so making important decisions based on them is risky: it may result in reporting incorrect business metrics to customers or improperly sharing sensitive data with third parties. This erodes trust in the data catalog, and is a common cause of organizations abandoning the use of data catalogs despite investing in them.

Shifting left

Let us suggest, instead of thinking about governance and annotation as an activity that happens post-hoc, embedding annotations directly into the schemas of our datasets as they are being created and updated. Software engineers have known for decades that documentation on code should live right next to the code itself, leading to solutions like Javadoc, Pydoc, etc. being embedded in the program code itself. It is not hard to apply similar ideas to the schema and data definition languages associated with SQL, Avro, Protobuf, Thrift, etc.

At LinkedIn, we primarily use PDL for service APIs and Avro for data streaming and storage APIs. In some cases, we support lossless transformation from PDL-structured data into Avro for storage and streaming. Over the past few years, we have built annotation capabilities directly in these schema languages, so that developers have all the tools they need to attach metadata to their data definitions at the point of origin. This ensures that as data is first defined, and then as it evolves through the feature additions and bug fixes, the metadata annotations are kept up to date. This also allows us to push policies as far left as possible. For example, for our event tracking schemas, we require that all fields must have business metadata attached to them, and fail the build if that is not provided; this is illustrated in the following diagram. 

diagram-illustrating-event-tracking-schemas

Rethinking the role of the data catalog

So what happens to the data catalog? Is it superfluous, then? Quite the opposite. As the schemas are checked-in, and deployed to production, the CI/CD machinery produces metadata change events which are consumed by the data catalog in a streaming fashion, producing an accurate and up-to-date reflection of the unified schema for all the datasets owned by the enterprise. We designed DataHub to be a stream-based metadata platform for exactly this reason, so we could connect it to a wide variety of metadata producers and a wide variety of metadata consumers.

Here is what the end to end metadata pipeline looks like:

image-of-end-to-end-metadata-pipeline

Notice how the data steward has shifted left and merged into the data owner or producer! Our experience has been that data quality and accountability are best served by putting data ownership and stewardship in the hands of the team that actually produces the data. This solution has allowed us to scale our governance efforts by federating it across all of our engineering teams.

Interested? Join us at Metadata Day!

We are excited to see strategies like the above take off in the DataHub community. We have previously heard from companies like Saxo Bank and Zendesk that they are managing business metadata embedded alongside their Protobuf schemas. For people familiar with the highly popular dbt project, DataHub also supports embedding metadata alongside dbt model definitions using the meta keyword. We are sure that there are many great ideas like this which have yet to be implemented.

As organizations move to the cloud and apply code-first principles to their infrastructure, we are seeing increasing interest in applying code-first principles to the age-old problems of data governance. To facilitate a broader conversation about this topic, LinkedIn is organizing a two-day conference, Metadata Day 2022 together with Acryl Data and the DataHub community on May 17th and 18th. Join us for a packed agenda including a hackathon, our yearly panel of experts drawn from industry and academia, and invited talks by data leaders from small and large organizations. Click here to reserve your spot!