Distributed Data Systems

The Distributed Data Systems (DDS) team builds horizontally scalable data storage and streaming systems used to serve LinkedIn applications.  We have a diverse portfolio of technology solutions summarized below.

Storage

Espresso

Espresso is LinkedIn’s horizontally scalable document store for primary data such as member and company profiles, InMail, social gestures (likes, comments, shares) various advertising data sets, etc.  Features that differentiate Espresso from other NoSQL data stores include:

  • Transactionally consistent local secondary indexes
  • Support for hierarchical data models
  • Real-Time change data capture for nearline processing
  • Support for Active/Active multi data center deployments
  • Transactional writes for closely interrelated (co-partitioned) data

 

  • Espresso

The system serves over of 1.3 million qps at peak.

Voldemort

Voldemort is an eventually consistent key:value store patterned after Amazon’s Dynamo1.  Data is partitioned and automatically replicated across multiple servers.  Voldemort provides tunable consistency, pluggable serialization (Protocol Buffers, Avro, Thrift, and Java serialization), and pluggable storage engines (BDB-JE, MySQL, RocksDB).

  • Voldemort

LinkedIn uses Voldemort for serving derived data such as recommended content produced offline on Hadoop.

Venice

Venice is an asynchronous data serving platform which builds upon the lessons learned from operating Voldemort at scale. Venice specializes in serving the derived data bulk loaded from offline systems (such as Hadoop) as well as the derived data streamed from nearline systems (such as Samza). Because the derived data use cases do not require strong consistency, read-your-writes semantics, transactions nor secondary indexing, Venice can be highly optimized for them and deliver a simpler, more efficient, architecture than consistent synchronous systems like Espresso and Oracle.

  • Venice

Streams

Kafka

Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.  Kafka topics can be partitioned for parallel consumption by multiple consumers.

  • Kafka

Kafka is a wildly successful Apache opens source project that was developed at LinkedIn and is used throughout the industry by companies including Yahoo, Twitter, Netflix, Pinterest, Uber, Airbnb and hundreds of others.  LinkedIn continues to be a leading contributor to the Apache source base, as well as maintaining an internal ecosystem surrounding Apache Kafka.

LinkedIn operates the largest known Kafka installation anywhere - sending approximately 1.5 trillion messages a day.

Datastream

Datastream is the next generation version of Databus, LinkedIn's change-capture solution for consuming database updates. Datastream, along with Kafka (for message pub-sub), and Samza (for realtime stream-processing), form the essential pieces of the stream-processing pipeline within LinkedIn.

  • Datastream

Samza

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

  • Samza
  • Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple callback-based “process message” API comparable to MapReduce.
  • Managed state: Samza manages snapshotting and restoration of a stream processor’s state. When the processor is restarted, Samza restores its state to a consistent snapshot. Samza is built to handle large amounts of state (many gigabytes per partition).
  • Fault tolerance: Whenever a machine in the cluster fails, Samza works with YARN to transparently migrate your tasks to another machine.
  • Durability: Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost.
  • Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.
  • Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.
  • Processor isolation: Samza works with Apache YARN, which supports Hadoop’s security model, and resource isolation through Linux CGroups.

Media

Ambry

Ambry is a distributed object store that supports storage of trillion of small immutable objects (50K -100K) as well as billions of large objects. It was specifically designed to store and serve media objects in web companies. However, it can be used as a general purpose storage system to store DB backups, search indexes or business reports. The system has the following characteristics:

  • Highly available and horizontally scalable
  • Low latency and high throughput
  • Optimized for both small and large objects
  • Cost effective
  • Easy to use

  • Ambry

Vector

Vector is Linkedin’s new media processing infrastructure which includes support for images, video and documents. Vector provides upload, processing (e.g. transcoding), metadata management and serving for images, documents and videos.

Media storage is entrusted to Ambry, Linkedin’s scalable blob storage solution. The Vector Asset manager exposes a comprehensive data model (backed by Espresso, Linkedin’s noSQL horizontally scalable document DB) with storage,retrieval, and basic indexing for media functionality. Vector initiates and carries out the necessary media processing, provides a plug-in infrastructure for processing modules, and handles dependencies and data pipelining.

  • Vector

Cloud Management

Helix

Apache Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. Helix automates reassignment of resources in the face of node failure and recovery, cluster expansion, and reconfiguration.

  • Helix

Helix provides:

  • Automatic assignment of resources and partitions to nodes 
  • Node failure detection and recovery 
  • Dynamic addition of resources 
  • Dynamic addition of nodes to the cluster
  • Pluggable distributed state machine to manage the state of a resource via state transitions
  • Automatic load balancing and throttling of transitions