Big Data Articles

  • helixupdate2

    Powering Helix’s Auto Rebalancer with Topology-Aware Partition Placement

    July 26, 2017

    Typical distributed data systems are clusters composed of a set of machines. If the dataset does not fit on a single machine, we usually shard the data into partitions, and each partition can have multiple replicas for fault tolerance. Partition management needs to ensure that replicas are distributed among machines as evenly as possible. More crucially, when a...

  • explodingdata1

    Managing "Exploding" Big Data

    June 15, 2017

    What is the shape of your big data? While we do love to talk about the size of our big data—terabytes, petabytes, and beyond—perhaps we are not paying due recognition to the shape of it. Big data comes in a variety of shapes. The Extract-Transform-Load (ETL) workflows are more or less stripe-shaped (left panel in the figure above) and produce an output of a...

  • async21

    Asynchronous Processing and Multithreading in Apache Samza, Part II: Experiments and Evaluation

    January 6, 2017

    This post is the second in a series discussing asynchronous processing and multithreading in Apache Samza. In the previous post, we explored the design and architecture of the new AsyncStreamTask API and the asynchronous event loop. In this post, we will focus on the study of the performance of this feature with benchmark Samza jobs. Some of the interesting...

  • Async1

    Asynchronous Processing and Multithreading in Apache Samza,...

    January 4, 2017

    As part of the Apache Samza 0.11 release, we rebuilt Samza’s underlying event processing engine to use an asynchronous and parallel...

  • streamprocess1

    Stream Processing Hard Problems Part II: Data Access

    August 22, 2016

    This post is the second in a series of posts that discuss some of the hard problems in stream processing. In the previous post, we...

  • Announcing Gobblin 0.7.0: Going Beyond Ingestion

    June 29, 2016

    About a year ago, we open sourced Gobblin, a universal data ingestion framework that aimed to solve data integration challenges faced...