Distributed Systems Articles

  • Incremental-Data-Capture-2

    Incremental Data Capture for Oracle Databases at LinkedIn: Then and Now

    November 22, 2017

    Co-authors: Saurabh Goyal and Janardh Bantupalli In our previous blog post introducing Brooklin, we outlined the reasons why we created our own framework for near real-time incremental data capture from production. This framework feeds data to our larger data ingestion pipeline for the hundreds of nearline applications processing data that are distributed across...

  • commentrelevance1

    Serving Top Comments in Professional Social Networks

    September 20, 2017

    Co-authors: Divye Kapoor, Zheng Li, and Pujita Mathur Introduction As a professional social network serving more than 500 million worldwide members, LinkedIn is the premier destination for professional conversations. We have a wide variety of posts that attract significant engagement, and some of these posts go viral. These posts attract likes and comments in...

  • helixupdate2

    Powering Helix’s Auto Rebalancer with Topology-Aware Partition Placement

    July 26, 2017

    Typical distributed data systems are clusters composed of a set of machines. If the dataset does not fit on a single machine, we usually shard the data into partitions, and each partition can have multiple replicas for fault tolerance. Partition management needs to ensure that replicas are distributed among machines as evenly as possible. More crucially, when a...

  • production_software2

    Building Venice: A Production Software Case Study

    April 4, 2017

    We build a lot of our own infrastructure systems here at LinkedIn. Many people have heard of Kafka, our distributed message buffer. We...

  • Venice3

    Building Venice with Apache Helix

    February 15, 2017

    Background Like many internet companies, LinkedIn has faced data growth challenges. Naturally, distributed storage systems became the...

  • Announcing Gobblin 0.7.0: Going Beyond Ingestion

    June 29, 2016

    About a year ago, we open sourced Gobblin, a universal data ingestion framework that aimed to solve data integration challenges faced...