Open sourcing DataHub: LinkedIn’s metadata search and discovery platform

Kerem Sahin

Software Engineer at Meta

February 18, 2020

Co-authors: Kerem Sahin, Mars Lan, and Shirshanka Das

Finding the right data quickly is critical for any company that relies on big data insights to make data-driven decisions. Not only does this impact the productivity of data users (including analysts, machine learning developers, data scientists, and data engineers), but it also has a direct impact on end products that rely on a quality machine learning (ML) pipeline. Additionally, the trend towards adopting or building ML platforms naturally begs the question: what is your method for internal discovery of ML features, models, metrics, datasets, etc.?

In this blog post, we will share the journey of open sourcing DataHub, our metadata search and discovery platform, starting with the project’s early days as WhereHows. LinkedIn maintains an in-house version of DataHub separate from the open source version. We will start by explaining why we need two separate development environments, followed by a discussion on the early approaches for open sourcing WhereHows, and a comparison of our internal (production) version of DataHub with the version on GitHub. We’ll also share details about our new automated solution for pushing and pulling open source updates to keep both repositories in sync. Finally, we’ll provide instructions on how to get started using the open source DataHub and briefly discuss its architecture.

WhereHows is now DataHub!

LinkedIn’s metadata team has previously introduced DataHub (successor of WhereHows), LinkedIn’s metadata search and discovery platform, and shared plans to open source it. Shortly after that announcement, we released an alpha version of DataHub and shared it with the community. Since then, we have continuously contributed to the repo and worked with interested users to add most requested features and resolve issues. Now, we are proud to announce the official release of DataHub on GitHub.

Open source approaches

WhereHows, LinkedIn’s original data discovery and lineage portal, started as an internal project; the metadata team open sourced it in 2016. From that time onwards, the team has always maintained two different codebases—one for open source, and the other for LinkedIn’s internal use—because not all product features developed for LinkedIn’s use cases were generally applicable to a broader audience. Also, WhereHows had some internal dependencies (infrastructure, libraries, etc.) which are not open sourced. WhereHows went through a lot of iterations and development cycles in the following years, which made keeping the two codebases in sync a big challenge. The metadata team attempted different approaches over the years to try to make internal and open source development in sync with each other.

First attempt: “Open source first”
Initially, we followed an "open source first" development model, where the main development takes place in the open source repo and changes are pulled in for internal deployment. The problem with this approach is that the code is always pushed to GitHub first before it is fully validated internally. Until the changes from the open source repo were pulled in and a new internal deployment took place, we would not discover any production issues. In the case of a bad deployment, it was also very hard to figure out the culprit, because changes were pulled in batches.

Also, this model decreased the productivity of the team when developing new features that needed fast iterations because it forced all changes to be pushed to the open source repo first and then brought them to the internal repository. To reduce turnaround time, the necessary fix or change could be done first in the internal repository, but this became a huge pain point when it came to merging those changes back to the open source repo, because the two repositories had gotten out of sync.

This model is much easier to implement for generic frameworks, libraries, or infrastructure projects than it is for full-stack custom web applications. Also, this model is perfect for projects that start out as open source from day one, but WhereHows had started out as a completely internal web app. It was really difficult to cleanly abstract all internal dependencies, which is why we needed to keep an internal fork, but keeping an internal fork and developing primarily in open source did not quite work for us.

Second attempt: “Internal first”
As a second attempt, we switched to an “internal first” development model, where the main development takes place internally and changes are pushed to open source on a regular basis. Although this model is best suited for our use case, it has inherent challenges. Directly pushing all the diff to the open source repo and then trying to resolve merge conflicts later is an option, but it’s time consuming. Developers will mostly avoid doing it with every code check-in. As a result, it will be done much less frequently, in batches, and thus increases the pain of resolving merge conflicts later.

Third time’s the charm!
The two failed attempts mentioned above had the consequence of leaving the WhereHows GitHub repository stale for a long time. The team continued to iterate on the product features and architecture, so LinkedIn’s internal version of WhereHows quickly became a better and much improved version than the open source one. It even had a new name, DataHub. Learning from previous failed attempts, the team has decided to devise a scalable long-term solution.

For any new open source project, LinkedIn’s open source team advises and supports a development model where building blocks/modules of the project are fully developed in open source. Versioned artifacts are deployed to a public repository and then brought back to LinkedIn’s internal artifactory using External Library Request (ELR). Following this development model is not only good for the open source community, but also results in a more modular, extensible, and pluggable architecture.

However, to achieve that state for a mature internal application like DataHub will take a significant amount of time. It also precludes the possibility of open sourcing a fully working implementation before all internal dependencies are completely abstracted out. Therefore, we’ve developed tooling that helps us make open source contributions faster and much less painful in the interim. This is a decision benefiting both the metadata team (the developer of DataHub) and the open source community. The following sections will discuss this new approach.

Automating open source contributions

The metadata team’s latest approach for open sourcing DataHub is to develop a tool that automatically syncs the internal codebase and the open source repository. High level features of this tooling include:

Syncing of LinkedIn code to/from open source, similar to rsync
License header generation, similar to Apache Rat
Auto-generation of open source commit logs from internal commit logs
Preventing internal changes that break open source build via dependency testing

In the following subsections, the above features, which have interesting challenges, will be discussed in detail.

Source code syncing
As opposed to the open source version of DataHub, which is a single GitHub repo, LinkedIn’s version of DataHub is a combination of multiple repos (known internally as multiproducts). DataHub’s frontend, metadata models library, metadata store backend service, and streaming jobs sit in different repositories within LinkedIn. However, for an easier experience for open source users, we have a single repository for the open source version of DataHub.

figure-showing-syncing-between-repositories

Figure 1: Syncing between LinkedIn DataHub repositories and single open source DataHub repository

To support automatic build, push, and pull workflows, our new tooling automatically creates a file-level mapping that corresponds to each source file. However, tooling requires an initial configuration and users have to provide a high-level mapping of modules, as shown below.

Module-level mapping is a simple JSON whose keys are the target modules in the open source repository and whose values are the list of source modules in the LinkedIn repositories. Any target module in the open source repo can be fed by any number of source modules. To denote internal repo names in source modules, Bash style string interpolation is used. Using a module-level mapping file, the tooling creates a file-level mapping file by scanning all files in the related directories.

File-level mapping is auto-generated by the tooling; however, it can be manually updated by the user as well. It is a 1:1 mapping from the LinkedIn source file to the one in the open source repository. There are some rules associated with this auto-generation of file mapping:

In the case of multiple source modules for a target module in open source, there might be collisions, such as the same FQCN existing in more than one source module. As a collision resolution strategy, our tooling uses “last one wins” by default.
“null” means a source file is not part of the open source repository.
After every push to open source or pull from it, this mapping is automatically updated and a snapshot is taken. This is needed to figure out source code additions and removals after the last push/pull action.

Commit log generation
Commit logs for open source commits are also auto-generated by aggregating the commit logs of internal repos. Below is a sample commit log to show the structure of a commit log generated by our tooling. The log clearly indicates which versions of the source repositories are packaged in this commit and provides an aggregate of commit log summaries. Check this commit for a real example of a commit log generated by our tooling.

Dependency testing
LinkedIn has a dependency testing infrastructure that helps ensure that changes to an internal multiproduct do not break the build in dependent multiproducts. The open source DataHub repository is not a multiproduct and it can’t be a direct dependency on any multiproduct, but with the help of a wrapper multiproduct—which pulls in the source code of open source DataHub—we can still utilize this dependency testing system. In this way, any change (that would possibly be open sourced later on) in any one of the multiproducts that feed the open source DataHub repository triggers a build event on the wrapper multiproduct. Therefore, any change which fails the build on the wrapper multiproduct fails pre-commit tests of the original multiproduct and gets reverted.

This is a useful mechanism that helps prevent any internal commit that breaks the open source build and detects it at the time of commit creation. Without this, it would be pretty hard to figure out which internal commit had caused an open source repo build failure because we push batched internal changes to the open source DataHub repo.

Differences between open source DataHub and our production version

Up to this point, we’ve discussed our solution to sync two versions of DataHub repositories, but we still haven’t outlined the reasons why we need two different development flows in the first place. In this section, we’ll list out the differences between the public version of DataHub and the version that is in production on LinkedIn’s servers, as well as the rationale for these differences.

One source of divergence stems from the fact that our production version has dependencies to not-yet-open-sourced code, such as LinkedIn’s Offspring (LinkedIn’s internal dependency injection framework). Offspring is used extensively in the internal code base because it is the preferred method for dynamic configuration management. But, it is not open sourced; therefore, we needed to find open source alternatives to it for the open source DataHub.

There are other reasons, as well. As we build extensions to the metadata model for LinkedIn’s needs, those extensions will typically start out very specific to LinkedIn and may not apply directly to other environments. For example, we have very specific labels for member identifiers and other types of compliance metadata. So, we have currently excluded those extensions in the open source DataHub metadata model. As we engage with the community and understand their needs, we will work on open sourcing generalized versions of these extensions where appropriate.

Ease of use and easier adoption for the open source community also inspired some of the differences between the two DataHub versions. Differences in stream processing infrastructure are a good example of that. Although our internal version uses a managed stream processing infrastructure, we chose to use embedded (standalone) stream processing for the open source version because it avoids creating yet another infrastructure dependency.

Another example of a difference is having a single GMS (Generalized Metadata Store) in the open source implementation, rather than having multiple GMS. GMA (Generalized Metadata Architecture) is the name of the backend architecture for DataHub, and the GMS is the metadata store in the GMA context. GMA is a very flexible architecture and it allows you to distribute every data construct (such as datasets, users, etc.) into its own metadata store, or to keep multiple data constructs in a single metadata store, as long as the registry that holds the mapping from data construct to GMS is updated. For ease of use, we chose to have a single GMS instance that stores all of the different data constructs in the open source DataHub.

The complete list of differences between the two implementations are listed in the table below.

table
Product Features	LinkedIn DataHub	Open Source DataHub
Supported Data Constructs	Datasets Users Metrics ML Features Charts Dashboards	Datasets Users
Supported Metadata Sources for Datasets	Ambry Couchbase Dalids Espresso HDFS Hive Kafka MongoDB MySQL Oracle Pinot Presto Seas Teradata Vector Venice	Hive Kafka RDBMS
Pub-sub	LinkedIn Kafka	Confluent Kafka
Stream Processing	Managed	Embedded (standalone)
Dependency Injection & Dynamic Configuration	LinkedIn Offspring	Spring
Build Tooling	Ligradle (LinkedIn’s internal Gradle wrapper)	Gradlew
CI/CD	CRT (LinkedIn’s internal CI/ CD)	TravisCI and Docker Hub
Metadata Stores	Distributed multiple GMS: Dataset GMS User GMS Metric GMS Feature GMS Chart/Dashboard GMS	Single GMS for: Datasets Users

Microservices in Docker containers

Docker facilitates deployment and distribution of applications by using containerization. Every piece of service in open source DataHub, including infrastructure components like Kafka, Elasticsearch, Neo4j, and MySQL, each has its own Docker image. For orchestration of Docker containers, we used Docker Compose.

Figure 2: Open source DataHub architecture

You can see the high-level architecture of DataHub in the above picture. Aside from infrastructure components, it has four different Docker containers:

datahub-gms: Metadata store service
datahub-frontend: Play application, which serves DataHub frontend
datahub-mce-consumer: Kafka Streams application that consumes from Metadata Change Event (MCE) stream and updates metadata store
datahub-mae-consumer: Kafka Streams application that consumes from Metadata Audit Event (MAE) stream and builds search index and graph db

The open source repo documentation and original DataHub blog post have more details about the functions of different services.

CI/CD in open source DataHub

The open source DataHub repo adopts TravisCI for continuous integration and Docker Hub for continuous deployment. Both have good integration with GitHub and are easy to set up. Most of the open source infrastructure developed by the community or private companies (e.g., Confluent) have Docker images built and they are deployed to Docker Hub for easier use by the community. Any Docker image found in Docker Hub could be easily used through a simple docker pull command.

With every commit in the DataHub open source repo, all Docker images are automatically built and deployed to Docker Hub with the “latest” tag. With some regex branch naming set up in Docker Hub, all tags in open source repo are also released with corresponding tag names to Docker Hub.

Try DataHub

Setting up DataHub is very simple and involves three easy steps:

Clone the open source repo and start all Docker containers using docker-compose via a provided docker-compose script for quickstart.
Ingest the sample data provided in the repo using a command line tool, which is also provided.
View DataHub in your browser!

An actively monitored Gitter chat room is also setup for quick questions. Users can also create issues directly in the GitHub repo. Most importantly, we welcome and appreciate all feedback and contributions!

Future plans

Currently, every infrastructure or microservice for open source DataHub is built as a Docker container, and the whole system is orchestrated using docker-compose. Considering the popularity and wide adoption of Kubernetes, we would also like to provide a Kubernetes-based solution in the near future.

We also plan to provide an out-of-the-box solution to deploy DataHub to a public cloud service such as Azure, AWS, or Google Cloud. Given the recent announcement of LinkedIn’s migration to Azure, this will align well with the metadata team’s internal priorities.

Acknowledgments

Special thanks to Mars Lan and Shirshanka Das for their technical leadership and guidance on every aspect of the project, and to Chris Lee, Seyi Adebajo, and Ignacio Bona for their relentless contributions to open source DataHub.

Thanks to the management team for funding this project and making open source one of the top goals of the organization: Kapil Surlaker, Suja Viswesan, Tai Tran, and Pardhu Gunnam. Thanks to Madhumita Mantri for the help on shepherding the creation of this blog post.

Thanks to Szczepan Faber, Michael Kehoe, and Stephen Lynch for reviewing this blog post.

Last but not least, thanks to all early adopters of DataHub in the open source community who evaluated alpha releases of DataHub and helped us identify issues and improve the documentation.

Topics: Open Source Data Data Management Infrastructure