Stream Processing Articles

  • gobblinlogo1

    Gobblin Enters Apache Incubation

    January 17, 2018

    Gobblin is a distributed data integration framework that simplifies common aspects of big data integration, such as ingestion, replication, organization, and lifecycle management, for both streaming and batch ecosystems. Gobblin has been gobbling big data with ease in the open source world since December 2014. Over the years, Gobblin has evolved at a tremendous...

  • venice1

    Venice Hybrid: Doing Lambda Better

    December 20, 2017

    Over the last two years at LinkedIn, I’ve been working on a distributed key-value database called “Venice.” Venice is designed to be a significant improvement to Voldemort Read-Only for serving derived data. In late 2016, Venice started serving production traffic for batch use cases that were very similar to the existing uses of Voldemort Read-Only. In the time...

  • Incremental-Data-Capture-2

    Incremental Data Capture for Oracle Databases at LinkedIn: Then and Now

    November 22, 2017

    Co-authors: Saurabh Goyal and Janardh Bantupalli In our previous blog post introducing Brooklin, we outlined the reasons why we created our own framework for near real-time incremental data capture from production. This framework feeds data to our larger data ingestion pipeline for the hundreds of nearline applications processing data that are distributed across...

  • brooklin-1

    Streaming Data Pipelines with Brooklin

    October 11, 2017

    Near-realtime (nearline) applications drive many of the critical services within LinkedIn, such as notifications, ad targeting, etc....

  • testing_samza1

    Test Strategy for Samza/Kafka Services

    April 27, 2017

    Over a decade ago, test strategies invested heavily in UI-driven tests. Backend and mid-tier services were tested using automated...

  • async21

    Asynchronous Processing and Multithreading in Apache Samza,...

    January 6, 2017

    This post is the second in a series discussing asynchronous processing and multithreading in Apache Samza. In the previous post, we...