Data processing is central to creating many of the data products on linkedin.com as well as understanding the performance of our products and businesses. Our analytics platforms, data pipelines, and data applications allow data consumers at LinkedIn to understand the data and create insights that power both internal and external user experiences.
Data processing workflows must evolve rapidly in order to react to changes in the data they process. This is especially true at a company like LinkedIn and the data they process must evolve rapidly over time.
Dali is a collection of libraries, services, and development tools united by the common goal of providing a logical data access layer for Hadoop and Spark.
Heterogenous data sources and sinks are a reality in all modern enterprises. Ingesting and moving relevant data is an operational necessity to gain insights that provide deep competitive advantage. At LinkedIn, we used to historically operate over 15 flows to load, transform, and transport data so that it is then consumed by numerous groups in our organization. This resulted in significant data quality, metadata management, development, and operation challenges. The limitations of implementing and operating numerous custom designed pipelines was the genesis for building Gobblin.
Gobblin solves this problem by:
- Focussing on generality.
- Providing an extensible set of data sources that can be extended as required.
- Providing end to end metrics that provide continuous data validation.
Since, it was open sourced in Feb 2015, we have replaced the vast majority of our custom flows with Gobblin Flows. Most recently, we retired Camus which was a dedicated flow to ingest Kafka data with Gobblin-Kafka. With this change, Gobblin ingests the largest amount of data within LinkedIn.
Over the last year, we have enhanced Gobblin to add the following features:
- Distcp in Gobblin: Provide enhanced distcp like functionality that is dataset aware and can provide features like flow isolation and prioritization across datasets.
- Enhanced configuration management that allows specifying properties in a data-set/cluster overridable manner allowing flexibility of ingest and data movement.
- Exactly once semantics: Enhanced Gobblin to provide exactly once semantics in addition to the previously support at least once semantics.
- Integration with systems like Yarn for continuous data processing.
We continue to see robust open source engagement and adoption with Gobblin and are committed to nurturing it.
Modern web scale companies have outgrown traditional analytical products due to challenges in analyzing massive scale, fast moving datasets at real-time latencies. Before Pinot, the analytics products at LinkedIn were built using generic storage systems like Oracle (RDBMS) and MySQL but these systems are not specialized for OLAP needs and the data volume at LinkedIn was growing exponentially in both breadth and depth. These, combined with widening needs across the company, required a single leverage-able system, which was the impetus for building Pinot.
Pinot, a real-time distributed OLAP data store, which was built at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from batch data sources (such as Hadoop and flat files) as well as streaming sources (such as Kafka). Pinot is optimized for analytical use cases on immutable append-only data and offers data freshness that is on the order of a few seconds. Pinot is horizontally scalable and scales to hundreds of tables, machines and thousands of queries per second in production.
Pinot is the infrastructure that powers many member facing analytics products and internal analytics products.
Key Features include:
- A column-oriented database with various compression schemes such as Run Length, Fixed Bit Length
- Pluggable indexing technologies - Sorted Index, Bitmap Index, Inverted Index
- Ability to optimize query/execution plan based on query and segment metadata
- Near real time ingestion from Kafka and batch ingestion from Hadoop
- SQL like language that supports selection aggregation, filtering, group by, order by, distinct queries on fact data.
Pinot powers some of LinkedIn's more recognizable experiences such as Who Viewed My Profile, Job and Publisher Analytics and many more. In addition to that, Pinot also powers LinkedIn's internal reporting platform, helping hundreds of analysts and product managers make data driven decisions. In 2015, we open sourced Pinot, and are committed to building a developer community around it.
Unified Metrics Pipeline (UMP) is a specification and a set of tools used to facilitate the creation of streamlined and consistent metrics data at LinkedIn. After onboarding to UMP, metrics are then available in our A/B testing platform (XLNT), reporting platforms (Raptor, EasyBI, Tesla), as well as for ad-hoc reporting. Metrics from UMP support over 300 data pipelines that process tracking, database dumps, and other data.
UMP is as a basis for LinkedIn Reporting Platform. UMP provides common language, best practices, and a workflow for all teams producing metrics data. UMP metrics are created offline from data that resides in Hadoop or, in the future, on the fly from the Kafka events.
Before the unified platform, the reporting was mostly fragmented, siloed and ad-hoc. Multiple stakeholders come up with different ways to calculate the same metric arriving at slightly different results. There was no way to share the metrics across the organization and across the reporting platforms, leading to effort duplication. And as the pipelines get more complex, the reliability suffered.
Today, the Unified Metrics Platform serves as the single source of truth for all business metrics at Linkedin by providing a centralized metrics processing pipeline (as-a-service), a metrics computation template, a set of tools and process to facilitate metrics life-cycle in a consistent, trustful and streamlined way.
Making data driven decisions through experimentation is an extremely important part of the culture at LinkedIn. It’s deeply ingrained in our development process and has always been a core part of Linkedin’s DNA. We test everything from complete redesigns of our homepage, to back-end relevance algorithm changes and even infrastructure upgrades. It’s how we innovate, grow and evolve our products to best serve our members. It’s how we make our members happier, our business stronger and our talent more productive.
We have built an internal end-to-end A/B testing platform, called XLNT, to quickly quantify the impact of any A/B test in a scientific and controlled manner across LinkedIn.com and our apps. XLNT allows for easy design and deployment of experiments, but it also provides automatic analysis that is crucial in popularizing A/B tests. The platform is generic and extensible, covering almost all domains including mobile and email. Every day, over 300 experiments are run and 1000s of metrics computed to improve every aspect of LinkedIn,
LinkedIn’s experimentation efforts do not stop at this. Scientists and engineers are constantly looking for rigorous ways to achieve better experimentation. For example, to better understand network influence in experiments, to measure long-term impact, and to automatically identify insights on why metrics move. These problems are extremely exciting and challenging at the scientific and engineering levels, but more importantly are key to the continued success of LinkedIn.
In modern data-driven businesses, the complexity that arises from fast-paced analytics, data mining and ETL processes makes metadata increasingly important. However, traditionally metadata is typically stored and queried inside the system that generates it. Examples of this include databases like Oracle, Teradata, Hive on Hadoop; NoSQL datastores like MongoDB, Cassandra; ETL systems like Informatica; BI systems like Microstrategy and scheduling systems like Oozie, Azkaban, UC4 etc. This siloing of metadata causes problems; each system has its own partial view of the end to end data pipeline and data storage organization. It is very hard for data producers, consumers and other interested parties (e.g. legal compliance teams) to perform fast and accurate analysis of the entire data ecosystem for data provenance and compliance related use-cases.
WhereHows, a project of the LinkedIn Data team, aims to solve this problem by creating a central metadata repository for the processes, people, and knowledge around the most important element of any big data system: the data itself. The repository has captured the status of 50 thousand datasets (with more than 15 petabytes storage footprint across multiple Hadoop, Teradata and other clusters), 14 thousand comments, 35 million job executions and related lineage information. Current challenges include the quest for a generic data model, expansion to NoSQL datastores and stream processing systems.