Evolving LinkedIn’s analytics tech stack
Lessons from a large-scale data platform migration
December 7, 2021
Having recently transitioned LinkedIn’s analytics stack (including 1400+ datasets, 900+ data flows, and 2100+ users) to one based on open source big data technologies, we wanted to give an overview of the journey in a blog post. This move freed us from the limits imposed by third party proprietary (3PP) platforms and saved millions in licensing, support, and other operating expenses. Furthermore, the move gave us more control of our own destiny via the freedom to extend and enhance our components and platforms. Just as important, we leveraged this opportunity to re-envision our data lake/warehouse strategy.
While large-scale technology transitions are typically mired in complexity and delays, our tooling and approach allowed for execution that was ahead of schedule with zero negative production impact. We were also able to use this process as a driving force to improve the overall analytics ecosystem for our developers and users.
In this blog post, we will discuss:
How we navigated and executed a large-scale data and user migration.
How we used a technology transition as an opportunity to improve our data ecosystem.
The learnings from our experience.
The legacy analytics tech stack
During LinkedIn’s early stages, we leveraged several 3PP data platforms to fuel rapid growth. While this would result in some limitations later on, it was far quicker at the time to piece together off-the-shelf products to meet our needs. The resulting architecture (Figure 1) of ETL to a data warehouse with reporting and analytics built atop is a standard pattern.
Figure 1: LinkedIn’s legacy analytics tech stack
Although this stack served us well for six years, it had the following disadvantages:
Lack of freedom to evolve: Because of the closed nature of this system, we were limited in options for innovation. Furthermore, integration with internal and open source systems was a challenge.
Difficulty in scaling: Data pipeline development was limited to a small central team due to the limits of Informatica/Appworx licenses. This increasingly became a bottleneck for LinkedIn’s rapid growth.
These disadvantages motivated us to develop a new data lake on Hadoop in parallel. However, we did not have a clear transition process and ended up maintaining both the old and new data warehouses simultaneously. Data was copied between the stacks, which resulted in double the maintenance cost and complexity, along with very confused consumers (Figure 2).
Figure 2: Maintaining redundant data warehouses led to unnecessary complexity
As a result, we planned and executed a migration of all datasets to the new stack.
Migration planning with dataset lineage and usage
Early on, we realized that planning for our massive migration load would be difficult without first deriving our dataset lineage and usage. This knowledge would enable us to:
Plan the order of dataset migration—i.e., start from datasets without dependencies and work upwards towards their dependents.
Identify zero or low usage datasets for workload reduction.
Track the % of users on the new vs. old system, which is a key performance indicator (KPI).
As valuable as dataset lineage could be, there was no out-of-the-box product/solution that supported our heterogeneous environment involving 3PP, Teradata (TD), and Hadoop. There was a project underway to build dataset lineage as part of DataHub, but this was a large effort covering all datasets, not just those in the warehouse, and thus couldn't be completed in the timelines required for this project. As a result, we took on the challenge of building the necessary tooling upfront to help plan and execute the migration.
First, we created a data product to provide the up/downstream dependencies of TD datasets. Second, we created data pipelines to extract the usage information of our datasets. To obtain TD metadata to support these efforts, we had no choice but to scrape TD logs, as the closed source system did not offer any other way. However, we had a more elegant solution to obtain Hadoop metadata. We added instrumentation to MapReduce and Trino to emit Kafka events with detailed dataset access metadata. The events were then ingested by a Gobblin job into HDFS and processed with a Hive script for downstream consumption and dashboards. This process is summarized in the diagram below (Figure 3).
Figure 3: Data pipeline for dataset usage extraction
As an interesting sidenote, the data pipeline above started as an intern project and ended as a cornerstone of the migration. At LinkedIn, we strive to give real world, impactful projects to interns, and this is a great example of the outsized impact our interns can deliver!
With these tools, we extensively cataloged our datasets to aid our planning. These catalogs helped us plan major data model revisions, providing a holistic view to help us surface redundant datasets and conform fact/dimension tables. As a result, we consolidated 1424 datasets down to 450, effectively cutting ~70% from our migration workload. In addition, previously our data models were heavily influenced by upstream OLTP sources, which were highly normalized and designed for row level transactions. However, for our complex business analytics, these models translated to pervasive and expensive table joins. To this end, we de-normalized and pre-joined our tables to streamline our analytics (Figure 4).
Figure 4: Pre-joining and denormalizing tables to streamline analytics
In summary, dataset lineage is invaluable for migrations, but may not be available out-of-the-box. A migration provides a great opportunity to build tools for proper bookkeeping, which has many benefits beyond the migration itself.
Migrating to a new data ecosystem
After planning, it was time to move all datasets to the new ecosystem. The design of this new ecosystem was heavily influenced by our migration away from TD, as it addressed pain points from our legacy TD stack.
Democratization of data: The Hadoop ecosystem enabled data development and adoption by other teams at LinkedIn. In contrast, previously only a central team could build TD data pipelines due to license limits.
Democratization of tech development with open source projects: We could now freely enhance all aspects of our tech stack with open source or custom-built projects. This enabled us to develop many innovations to handle data at scale (e.g., Coral, Data Sentinel, and Magnet).
- Unification of tech stack: Simultaneously running TD and Hadoop taught us the cost and complexity of maintaining redundant systems. In the new system, we unified the technologies and workflows used in data pipelines (Figure 5). This enabled us to focus our maintenance and enhancements in a single place, which greatly boosted our efficiency. Beyond our use case, many other LinkedIn applications had been converging onto Hadoop (e.g., machine learning, alerting, and experimentation), which allowed us to synergize and leverage the groundwork laid by them.
Figure 5: LinkedIn’s new business analytics tech stack
This tech stack has the following components:
Unified Metrics Pipeline (UMP): A unified platform where developers provide ETL scripts to create data pipelines.
Azkaban: A distributed workflow scheduler that manages jobs on Hadoop (including UMP).
Dataset readers: Datasets are on HDFS and can be read in a variety of ways.
- Dashboards and ad-hoc queries for business analytics
- DALI (Data Access at LinkedIn) reader: An API for developers to read data without worrying about its storage medium, path, and format.
The migration also acted as a catalyst for us to re-evaluate and improve our data pipeline performance. First, we had to address the poor read performance of the Avro file format, which we inherited from our upstream components. Consequently, we migrated to ORC, and saw the read speed increase ~10-1000x, along with a ~25-50% improvement in compression ratio. Second, we migrated poorly performing Hive/Pig flows to Spark, which reduced their runtimes by ~80%. Finally, we tuned our jobs to ensure proper resource usage and performance.
By building our platform on top of a tech stack over which we had full control, we empowered ourselves as engineers and defined our own innovative culture.
User migration and dataset deprecation automation
After the dataset migration, we still had to orchestrate the migration of 2100+ TD users and the deprecation of 1400+ TD datasets. If this process were done manually, it would be tedious, prone to human error, and very costly in terms of human resources. Automating the process could avoid these shortfalls, but would require us to create a service with its own share of complexities.
We proceeded with a middle path that used automation to help coordinate the task. A high-level architectural diagram of our solution is shown below (Figure 6). Our backend consisted of a MySQL operational datastore with data from our dataset lineage solution. We built an API service on the backend to coordinate the deprecation. The service identified candidates for deprecation, i.e., TD datasets with no dependency and low usage. Subsequently, the service sent emails to dataset users about the upcoming deprecation and notified SREs to lock, archive, and delete the dataset from TD after a grace period.
Figure 6: High level diagram of user migration and dataset deprecation tool
As a result, the process was less tedious, easier to manage, and only cost a fraction of what we estimated for the manual route. The larger lesson we would point to here is to seize opportunities to save time, effort, and cost through automation, where feasible.
From our experience, technology migrations are colossal efforts that can benefit from an innovative approach. Below, we provide a summary of our learnings for other organizations facing a similar situation.
First, seek opportunities to improve your data ecosystem during technology transitions. Beyond migrating to Hadoop, we also took this chance to vastly update our data models and improve data pipeline performance and workflows. A deep dive into a specific, high profile use case can be found in this blog post.
Second, be wary of performance and feature gaps when migrating from traditional 3PP databases. In order to match the performance of TD, Hadoop required intensive engineering efforts on our part (e.g., ORC conversion and Spark shuffle optimizations). In addition, we had to implement or forgo TD features that we took for granted (e.g., row based access controls). We suggest investigating these gaps early on for a smoother tech transition.
Third, leverage and build automated solutions when possible. In our case, we built solutions to derive dataset lineage and usage, to communicate with downstream users, and to deprecate the datasets. The end result was more cost effective and less error prone, despite the upfront investment required. Furthermore, these solutions can still assist with data management after the migration, making them a good long-term investment to develop.
In the future, we look forward to even larger technological transformations at LinkedIn; for instance, we are moving our entire tech stack to Microsoft Azure, which is our largest and most ambitious migration yet. As such, we will continue to build upon our learnings and enjoy the challenges of the ever-evolving technological landscape.
Evolving the analytics tech stack was a colossal undertaking that required the concentrated effort of many parties. Specifically, it was made possible due to the contributions from the many teams and groups within Big Data, Infrastructure, Site Reliability, and Technical Program Management teams. Also, thanks to the downstream consumers of the data that worked with us and held us accountable.