Stream Processing Articles

  • change-data-capture

    Open Sourcing Brooklin: Near Real-Time Data Streaming at Scale

    July 16, 2019

    Brooklin - a distributed service for streaming data in near real-time and at scale - has been running in production at LinkedIn since 2016, powering thousands of data streams and over 2 trillion messages per day. Today, we are pleased to announce the open-sourcing of Brooklin and that the source code is available in our Github repo!  Why Brooklin? At LinkedIn,...

  • setup-that-uses-LXC-to-emulate-a-YARN-cluster

    Using Virtual Private Clusters for Testing Apache Samza

    June 20, 2019

    If Apache Kafka is the lifeblood of all nearline processing at LinkedIn, then Apache Samza is the beating heart pumping that blood around. Samza at LinkedIn is provided as a managed stream processing service where applications bring their logic (leveraging the wide variety of Samza APIs), while the service handles the hosting, managing, and operations of the...

  • calcite1

    Bridging Offline and Nearline Computations with Apache Calcite

    January 29, 2019

    The existing Lambda architecture With the evolution of big data technologies over time, two classes of computations have been developed for processing large-scale datasets: batch and streaming. Batch computation was developed for processing historical data, and batch engines, like Apache Hadoop or Apache Spark, are often designed to provide correct and complete,...

  • samzalogo

    Samza 1.0: Stream Processing at Massive Scale

    November 27, 2018

    We are pleased to announce today the release of Samza 1.0, a significant milestone in the history of the project. Apache Samza is a...

  • unstructureddata1

    Unstructured Data Transfer in Rest.li

    November 2, 2018

    A few years ago, we announced Rest.li 2.x and a Protocol Upgrade Story. Today, we are excited to share another major milestone: the...

  • gobblinlogo1

    Gobblin Enters Apache Incubation

    January 17, 2018

    Gobblin is a distributed data integration framework that simplifies common aspects of big data integration, such as ingestion,...