DataHub: A generalized metadata search & discovery tool
August 14, 2019
Co-authors: Mars Lan, Seyi Adebajo, Shirshanka Das
Editor’s note: Since publishing this blog post, the team open sourced DataHub in February 2020. You can read more on the journey of open sourcing the platform here.
As the operator of the world’s largest professional network and the Economic Graph, LinkedIn’s Data team is constantly working on scaling its infrastructure to meet the demands of our ever-growing big data ecosystem. As the data grows in volume and richness, it becomes increasingly challenging for data scientists and engineers to discover the data assets available, understand their provenances, and take appropriate actions based on the insights. To help us continue scaling productivity and innovation in data alongside this growth, we created a generalized metadata search and discovery tool, DataHub.
To increase the productivity of LinkedIn’s data team, we had previously developed and open sourced WhereHows, a central metadata repository and portal for datasets. The type of metadata stored includes both technical metadata (e.g., location, schema, partitions, ownership) and process metadata (e.g., lineage, job execution, lifecycle information). WhereHows also featured a search engine to help locate the datasets of interest.
Since our initial release of WhereHows in 2016, there has been a growing interest in the industry to improve the productivity of data scientists by using metadata. For example, tools developed in this space include AirBnb’s Dataportal, Uber’s Databook, Netflix’s Metacat, Lyft’s Amundsen, and most recently Google’s Data Catalog. At LinkedIn, we have also been busy expanding our scope of metadata collection to power new use cases while preserving fairness, privacy, and transparency. However, we came to realize WhereHows had fundamental limitations that prevented it from meeting our evolving metadata needs. Here is a summary of the lessons we learned from scaling WhereHows:
- Push is better than pull: While pulling metadata directly from the source seems like the most straightforward way to gather metadata, developing and maintaining a centralized fleet of domain-specific crawlers quickly becomes a nightmare. It is more scalable to have individual metadata providers push the information to the central repository via APIs or messages. This push-based approach also ensures a more timely reflection of new and updated metadata.
- General is better than specific: WhereHows is strongly opinionated about how the metadata for a dataset or a job should look like. This results in an opinionated API, data model, and storage format. A small change to the metadata model will lead to a cascade of changes required up and down the stack. It would have been more scalable had we designed a general architecture that is agnostic to the metadata model it stores and serves. This in turn would have allowed us to focus on onboarding and evolving strongly opinionated metadata models without worrying about the lower layers of the stack.
- Online is as important as offline: Once the metadata has been collected, it’s natural to want to analyze that metadata to derive value. One simple solution is to dump all the metadata to an offline system, like Hadoop, where arbitrary analyses can be performed. However, we soon discovered that supporting offline analyses alone wasn’t enough. There are many use cases, such as access control and data privacy handling, that must query against the latest metadata online.
- Relationships really matter: Metadata often conveys important relationships (e.g., lineage, ownership, and dependencies) that enable powerful capabilities like impact analysis, data rollup, better search relevance, etc. It is critical to model all these relationships as first-class citizens and support efficient analytical queries over them.
- Multi-center universe: We realized that it is not enough to simply model metadata centered around a single entity (a dataset). There is an entire ecosystem of data, code, and human entities (datasets, data scientists, teams, code, microservice APIs, metrics, AI features, AI models, dashboards, notebooks, etc.) that need to be integrated and connected through a single metadata graph.
About a year ago, we went back to the drawing board and re-architected WhereHows from the ground up based on these learnings. At the same time, we realized the growing need within LinkedIn for a consistent search and discovery experience across various data entities, along with a metadata graph that connects them together. As a result, we decided to expand the scope of the project to build a fully generalized metadata search and discovery tool, DataHub, with an ambitious vision: connecting LinkedIn employees with data that matters to them.
We broke the monolithic WhereHows stack into two distinct stacks: a Modular UI frontend and a Generalized Metadata Architecture backend. The new architecture enabled us to rapidly expand our scope of metadata collection beyond just datasets and jobs. At the time of writing, DataHub already stores and indexes tens of millions of metadata records that encompass 19 different entities, including datasets, metrics, jobs, charts, AI features, people, and groups. We also plan to onboard metadata for machine learning models and labels, experiments, dashboards, microservice APIs, and code in the near future.
The DataHub web app is how most users interact with the metadata. The app is written using Ember Framework and runs atop a Play middle tier. To make the development scalable, we leverage various modern web technologies, including ES9, ES.Next, TypeScript, Yarn with Yarn Workspaces, and code quality tools like Prettier and ESLint. The presentation, control, and data layers are modularized into packages so that specific views in the app are built from a composition of relevant packages.
Component service framework
In applying a modular UI infrastructure, we’ve built the DataHub web app as a series of cohesive feature aligned components that are grouped into installable packages. This package architecture employs Yarn Workspaces and Ember add-ons at the foundation, and is componentized using Ember’s components and services. You can think of this as a UI that’s built using small building blocks (i.e., components and services) to create larger building blocks (i.e., Ember add-ons and npm / Yarn packages) that when put together, eventually constitute the DataHub web app.
With components and services at the core of the app, this framework allows us to pull apart different aspects and put together other features in the application. Additionally, segmentation at each layer provides a very customizable architecture that allows consumers to scale or streamline their applications to take advantage of only the features or onboard new metadata models relevant to their domain.
Interacting with DataHub
At the highest level, the frontend provides three types of interactions: (1) search, (2) browse, and (3) view/edit metadata. Here are some example screenshots from the actual app:
DataHub app screenshots
Similar to a typical search engine experience, a user can search for one or multiple types of entities by providing a list of keywords. They can further slice and dice the results by filtering through a list of facets. Advanced users can also utilize operators, such as OR, NOT, and regex, to perform complex searches.
The data entities in DataHub can be organized and browsed in a tree-like fashion, where each entity is allowed to appear at multiple places in the tree. This gives users the ability to browse the same catalog in different ways, e.g., by physical deployment configuration or business functional organization. There can even be a dedicated part of tree showing only “certified entities,” which are curated through a separate governance process.
The final interaction—view/edit metadata—is also the most complicated one. Each data entity has an “profile page” that shows all the associated metadata. For example, a dataset profile page may contain its schema, ownership, compliance, health, and lineage metadata. It can also show how the entity is related to others, e.g., a job that produced the dataset, metrics or charts that are computed from this dataset, etc. For metadata that are editable, users can also update it directly through the UI.
Generalized metadata architecture
In order to fully realize the vision of DataHub, we needed an architecture capable of scaling with the metadata. The scalability challenges come in four different forms:
- Modeling: Model all types of metadata and relationships in a developer friendly fashion.
- Ingestion: Ingest large amount of metadata changes at scale, both through APIs and streams.
- Serving: Serve the collected raw and derived metadata, as well as a variety of complex queries against the metadata at scale.
- Indexing: Index the metadata at scale, as well as automatically update the indexes when the metadata changes.
Simply put, metadata is “data that provides information about other data.” This brings two distinct requirements when it comes to metadata modeling:
- Metadata is also data: To model metadata, we need a language that is at least as feature-rich as the ones used for general purpose data modeling.
- Metadata is distributed: It is unrealistic to expect that all metadata come from a single source. For example, the system that governs the Access Control List (ACL) of a dataset is very likely to be different from the one that stores the schema metadata. A good modelling framework should allow multiple teams to evolve their metadata models independently, while presenting a unified view of all metadata associated with a data entity.
Instead of inventing a new way to model metadata, we chose to leverage Pegasus, an open-source and well-established data schema language created by LinkedIn. Pegasus is designed for general purpose data modeling and thus works well for most metadata. However, since Pegasus doesn’t provide an explicit way to model relationships or associations, we introduced some custom extensions to support these use cases.
To demonstrate how to use Pegasus to model metadata, let’s look at a simple example illustrated by the following modified Entity-Relationship Diagram (ERD).
The example contains three types of entities—User, Group, and Dataset—represented by the blue circles in the diagram. We use arrows to denote the three types of relationships between these entities, namely OwnedBy, HasMember, and HasAdmin. In other words, a Group is made up of one admin and multiple members of User, who can in turn own one or multiple Datasets.
Different from traditional ERD, we place the attributes of an entity and relationship directly inside the circle and below the relationship name, respectively. This allows us to attach a new type of component, known as “metadata aspects” to the entities. Different teams can own and evolve different aspects of metadata for the same entity without interfering with one another, thus fulfilling the distributed metadata modeling requirement. Three types of metadata aspects: Ownership, Profile, and Membership are included in the above example as green rectangles. The association of a metadata aspect to an entity is denoted using a dotted line. For instance, a Profile can be associated to a User, and an Ownership can be associated to a Dataset, etc.
You may have noticed that there are overlaps between the entity and relationship attributes with the metadata aspects, e.g., the firstName attribute of a User should be the same as the firstName field of the associated Profile. The reason for this repeated information will be explained in the later part of this post, but for now it’s sufficient to treat attributes as the “interesting part” of metadata aspects.
To model the example in Pegasus, we’ll translate each of the entities, relationships, and metadata aspects into individual Pegasus Schema file (PDSC). For brevity, we’ll only include one model from each category here. First, let’s take a look at a PDSC for the User entity:
Each entity is required to have a globally unique ID in the form of a URN, which can be treated as a typed GUID. The User entity has attributes including first name, last name, and LDAP, each mapping to an optional field in the User record.
Next up is the PDSC model for the OwnedBy relationship:
Each relationship model naturally contains the “source” and “destination” fields that point to the specific entity instances using their URNs. The model can optionally contain other attribute fields, such as “type” in this case. Here, we also introduce a custom property called “pairings” to restrict the relationship to specific pairs of source and destination URN types. In this case, the OwnedBy relationship can only be used to connect a Dataset to a User.
Finally, you’ll find the model for the Ownership metadata aspect below. Here we chose to model the ownership as an array of records containing a type and ldap fields. However, there’s virtually no limitations when it comes to modeling a metadata aspect, as long as it’s a valid PDSC record. This makes it possible to satisfy the “metadata is also data” requirement stated previously.
After all the models have been created, the logical next question is how to connect them together to form the proposed ERD. We’ll defer that discussion to the Metadata Indexing section in the later part of this post.
DataHub provides two forms of metadata ingestion: either through direct API calls or a Kafka stream. The former is for metadata changes that require read-after-write consistency, whereas the latter is more suited for fact-oriented updates.
DataHub’s API is based on Rest.li, a scalable, strongly typed RESTful service architecture used extensively within LinkedIn. As Rest.li uses Pegasus as its interface definition, all the metadata models defined in the previous section can be used verbatim. Gone are the days where multiple levels of model conversions were needed from the API down to the storage—the API and models will always stay in sync.
For Kafka-based ingestion, metadata producers are expected to emit a standardized Metadata Change Event (MCE), which contains a list of proposed changes to specific metadata aspects keyed by the corresponding entity URN. While the schema for MCE is in Apache Avro, it is generated automatically from the Pegasus metadata models.
Using the same metadata model for both the API and Kafka event schemas allows us to evolve the models easily without painstakingly maintaining the corresponding conversion logic. However, to achieve true seamless schema evolution, we need to limit all schema changes to be always backward compatible. This is enforced at build time with added compatibility checking.
At LinkedIn, we tend to rely more heavily on the Kafka stream due to the loose coupling it affords between producers and consumers. On a daily basis, we’re receiving millions of MCEs from various producers, and the volume is only expected to grow exponentially as we expand the scope of our metadata collection. To build the streaming metadata ingestion pipeline, we leveraged Apache Samza as our stream processing framework. The ingestion Samza job is purposely designed to be fast and simple to achieve high throughput. It simply converts the Avro data back to Pegasus and invokes the corresponding Rest.li API to complete the ingestion.
Once the metadata has been ingested and stored, it is important to serve the raw and derived metadata efficiently. DataHub is designed to support four types of commonly seen queries over large amount of metadata:
- Document-oriented queries
- Graph-oriented queries
- Complex queries that involves joins
- Full-text search
To achieve this, DataHub needs to make use of multiple kinds of data systems, each specialized to scale and serve limited types of queries. For example, Espresso is LinkedIn’s NoSQL database that is particularly suited for document-oriented CRUD at scale. Similarly, Galene can index and serve web-scale full-text searches with ease. When it comes to non-trivial graph queries, it is not surprising that a specialized graph DB can perform orders of magnitude better than RDBMS-based implementations. However, it turns out that the graph structure is also a natural way to represent foreign key relationships, allowing complex join queries to be answered efficiently.
DataHub further abstracts the underlying data systems through a set of generic Data Access Objects (DAO), such as key-value DAO, query DAO, and search DAO. Data system-specific implementation of DAOs can then be swapped in and out easily without altering any business logic in DataHub. This will ultimately enable us to open source DataHub with reference implementations for popular open-source systems, while still taking full advantage of LinkedIn’s proprietary storage technologies.
Another key benefit of the DAO abstraction is standardized Change Data Capture (CDC). Regardless of the type of the underlying data storage system, any update operation through the key-value DAO will automatically emit a Metadata Audit Event (MAE). Each MAE contains the URN of the corresponding entity, as well as both the before and after images of a particular metadata aspect. This enables a lambda architecture where MAEs can be processed both in batches or streams. Similar to MCE, the schema for MAE is also automatically generated from the metadata models.
The last missing piece of the puzzle is the metadata indexing pipeline. This is the system that connects the metadata models together and creates corresponding indexes in the graph DB and search engine to facilitate efficient queries. These business logics are captured in the form of an Index Builder and Graph Builder and gets executed as part of a Samza job that processes MAEs. Each builder registered their interest in the specific metadata aspects with the job and will be invoked with the corresponding MAE. The builder then returns a list of idempotent updates to be applied to the search index or graph DB.
The metadata indexing pipeline is also highly scalable, as it can be easily partitioned based on the entity URN of each MAE to support in-order processing for each entity.
Conclusion and looking forward
In this post, we introduced DataHub, our latest evolution in our metadata journey at LinkedIn. The project includes a Modular UI frontend and a Generalized Metadata Architecture backend.
DataHub has been running in production at LinkedIn for the past six months. It is visited every week by more than 1,500 employees, supporting search, discovery, and a variety of specific action workflows. LinkedIn’s metadata graph contains more than one million datasets, 23 data storage systems, 25k metrics, 500+ AI features, and most importantly all the LinkedIn employees who are the creators, consumers, and operators of this graph.
We’re continuing to improve DataHub by adding more interesting user stories and relevance algorithms to the product. We also plan to add native support for GraphQL and leverage Pegasus Domain Specific Language (PDL) to automate code generation in the near future. At the same time, we’re actively working on sharing this evolution of WhereHows with the open source community and will follow up with an announcement once DataHub is released publicly.
The effort of metadata collection naturally spans across multiple teams and organization boundaries. We’re thankful for many teams across Analytics Platform & Apps, Data Infrastructure, Artificial Intelligence, and Data Science that have collaborated closely with us to onboard their metadata to DataHub.
Special thanks to the stellar metadata engineering team for their tireless contributions to making DataHub a reality: Kapil Surlaker, Suja Viswesan, Tai Tran, and Pardhu Gunnam from the management team for their continued support and investment in this critical area; Chris White, Praveen Gujar, and Tushar Shanbhag for driving the product direction; An Ping and Selene Chew for designing a great UX for DataHub; and last but not least, Madhumita Mantri for shepherding the creation of this post.