Gobblin Articles

  • opal-data-flow

    Opal: Building a mutable dataset in data lake

    March 16, 2022

    Co-authors: Bhupendra Kumar Jain, Aditya Narain Gupta, Kuai Yu, and Hung Tran At LinkedIn, trusted data platforms and quality data pipelines are essential to meaningful business metrics and sound decision-making. Today, a considerable percentage of data at LinkedIn comes from online data stores. Whether the online data systems fall into SQL or NoSQL categories,...

  • diagram-illustrating-how-data-integration-library-provides-a-small-number-of-connectors-supporting-transfer-protocols-that-cover-the-vast-majority-of-the-use-cases

    Solving the data integration variety problem at scale, with Gobblin

    February 24, 2021

    Co-authors: Chris Li, Kevin Lau, and Subbu Sanka Editor’s Note: Recently, the Apache Software Foundation (ASF) announced Apache® Gobblin™ as a Top-Level Project (TLP). For more information, visit https://gobblin.apache.org/ and https://twitter.com/ApacheGobblin. Introduction Our big data ecosystem is larger than 1 exabyte and growing, while ingesting and...

  • FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format

    January 6, 2021

    Co-authors: Zihan Li, Sudarshan Vasudevan, Lei Sun, and Shirshanka Das Data analytics and AI power many business-critical use cases at LinkedIn. We need to ingest data in a timely and reliable way from a variety of sources, including Kafka, Oracle, and Espresso, bringing it into our Hadoop data lake for subsequent processing by AI and data science pipelines. We...

  • mock-screenshot-of-the-recruiter-usage-dashboard

    Bridging batch and stream processing for the Recruiter...

    July 14, 2020

    Co-authors: Khai Tran and Steve Weiss Batch and streaming computations are often combined together in the Lambda architecture, but...

  • lag-alert-graphs

    An inside look at LinkedIn’s data pipeline monitoring...

    October 30, 2019

    Co-authors: Krishnan Raman and Joey Salacup Editor's note: This blog has been updated. Monitoring big data pipelines often equates to...

  • gobblinlogo1

    Gobblin Enters Apache Incubation

    January 17, 2018

    Gobblin is a distributed data integration framework that simplifies common aspects of big data integration, such as ingestion,...