Big Data Articles

  • Taking Charge of Tables: Introducing OpenHouse for Big Data Management

    July 19, 2023

    Co-Authors: Sumedh Sakdeo, Lei Sun, Sushant Raikar, Stanislav Pak, and Abhishek Nath Introduction At LinkedIn, we build and operate an open source data lakehouse deployment to power Analytics and Machine Learning workloads. Leveraging data to drive decisions allows us to serve our members with better job insights, and connect the world’s professionals with each...

  • diagram-illustrating-how-data-integration-library-provides-a-small-number-of-connectors-supporting-transfer-protocols-that-cover-the-vast-majority-of-the-use-cases

    Solving the data integration variety problem at scale, with Gobblin

    February 24, 2021

    Co-authors: Chris Li, Kevin Lau, and Subbu Sanka Editor’s Note: Recently, the Apache Software Foundation (ASF) announced Apache® Gobblin™ as a Top-Level Project (TLP). For more information, visit https://gobblin.apache.org/ and https://twitter.com/ApacheGobblin. Introduction Our big data ecosystem is larger than 1 exabyte and growing, while ingesting and...

  • venice1

    Venice Hybrid: Doing Lambda Better

    December 20, 2017

    Over the last two years at LinkedIn, I’ve been working on a distributed key-value database called “Venice.” Venice is designed to be a significant improvement to Voldemort Read-Only for serving derived data. In late 2016, Venice started serving production traffic for batch use cases that were very similar to the existing uses of Voldemort Read-Only. In the time...

  • helixupdate2

    Powering Helix’s Auto Rebalancer with Topology-Aware...

    July 26, 2017

    Editor's note: This blog has been updated. Typical distributed data systems are clusters composed of a set of machines. If the dataset...

  • explodingdata1

    Managing "Exploding" Big Data

    June 15, 2017

    What is the shape of your big data? While we do love to talk about the size of our big data—terabytes, petabytes, and beyond—perhaps...

  • async21

    Asynchronous Processing and Multithreading in Apache Samza,...

    January 6, 2017

    This post is the second in a series discussing asynchronous processing and multithreading in Apache Samza. In the previous post, we...