Co-authors: Chris Li, Kevin Lau, and Subbu Sanka Editor’s Note: Recently, the Apache Software Foundation (ASF) announced Apache® Gobblin™ as a Top-Level Project (TLP). For more information, visit https://gobblin.apache.org/ and https://twitter.com/ApacheGobblin. Introduction Our big data ecosystem is larger than 1 exabyte and growing, while ingesting and...
Big Data Articles
-
- Topics:
- Gobblin,
- Big Data,
- Open Source
-
Over the last two years at LinkedIn, I’ve been working on a distributed key-value database called “Venice.” Venice is designed to be a significant improvement to Voldemort Read-Only for serving derived data. In late 2016, Venice started serving production traffic for batch use cases that were very similar to the existing uses of Voldemort Read-Only. In the time...
- Topics:
- Stream Processing,
- Big Data,
- Kafka,
- serving infrastructure
-
Editor's note: This blog has been updated. Typical distributed data systems are clusters composed of a set of machines. If the dataset does not fit on a single machine, we usually shard the data into partitions, and each partition can have multiple replicas for fault tolerance. Partition management needs to ensure that replicas are distributed among machines as...
- Topics:
- Apache Helix,
- Big Data,
- Distributed Systems,
- Open Source
-
What is the shape of your big data? While we do love to talk about the size of our big data—terabytes, petabytes, and beyond—perhaps...
- Topics:
- Big Data,
- machine learning,
- data science,
- Analytics
-
This post is the second in a series discussing asynchronous processing and multithreading in Apache Samza. In the previous post, we...
- Topics:
- Apache Samza,
- Stream Processing,
- Big Data,
- Kafka
-
As part of the Apache Samza 0.11 release, we rebuilt Samza’s underlying event processing engine to use an asynchronous and parallel...
- Topics:
- Apache Samza,
- Stream Processing,
- Big Data,
- Kafka