Stream Processing Articles

  • FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format

    January 6, 2021

    Co-authors: Zihan Li, Sudarshan Vasudevan, Lei Sun, and Shirshanka Das Data analytics and AI power many business-critical use cases at LinkedIn. We need to ingest data in a timely and reliable way from a variety of sources, including Kafka, Oracle, and Espresso, bringing it into our Hadoop data lake for subsequent processing by AI and data science pipelines. We...

  • from-lambda-to-lambdaless-lessons-learned

    From Lambda to Lambda-less: Lessons learned

    December 1, 2020

    Co-authors: Xiang Zhang and Jingyu Zhu Introduction The Lambda architecture has become a popular architectural style that promises both speed and accuracy in data processing by using a hybrid approach of both batch processing and stream processing methods. But it also has some drawbacks, such as complexity and additional development/operational overheads. One of...

  • table-comparing-the-nexmark-benchmark-results

    Building a better and faster Beam Samza runner

    October 1, 2020

    Co-authors: Yixing Zhang, Bingfeng Xia, Ke Wu, and Xinyu Liu Since Beam Samza runner was developed in 2018 at LinkedIn, we now have 100+ Samza Beam jobs running in production. As our usage grew, we wanted to better understand how the Samza runner performs compared to other runners and identify areas of improvement. In general, for stream processing platforms,...

  • mock-screenshot-of-the-recruiter-usage-dashboard

    Bridging batch and stream processing for the Recruiter...

    July 14, 2020

    Co-authors: Khai Tran and Steve Weiss Batch and streaming computations are often combined together in the Lambda architecture, but...

  • change-data-capture

    Open sourcing Brooklin: Near real-time data streaming at...

    July 16, 2019

    Editor's note: This blog has been updated. Brooklin—a distributed service for streaming data in near real-time and at scale—has been...

  • setup-that-uses-LXC-to-emulate-a-YARN-cluster

    Using virtual private clusters for testing Apache Samza

    June 20, 2019

    If Apache Kafka is the lifeblood of all nearline processing at LinkedIn, then Apache Samza is the beating heart pumping that blood...