Powering Helix’s Auto Rebalancer with Topology-Aware Partition Placement

Lei Xia

Engineering Manager @ Databricks

July 26, 2017

Editor's note: This blog has been updated.

Typical distributed data systems are clusters composed of a set of machines. If the dataset does not fit on a single machine, we usually shard the data into partitions, and each partition can have multiple replicas for fault tolerance. Partition management needs to ensure that replicas are distributed among machines as evenly as possible. More crucially, when a node fails, it is important that the partitions hosted on that node are reallocated evenly among the remaining nodes. In general, when a machine joins or leaves the cluster, partitions need to be re-balanced across the new or remaining machines. The placement of partitions in a distributed data system is essential for the reliability and scalability of the system.

Partition assignments are critical to a typical distributed data system. A partition’s replicas could be in different states. For example, in the above graph, each partition has three replicas; one of them is the Primary replica, while the other two are Backup replicas.

Apache Helix is a generic cluster management framework used for the automatic management of partitioned and replicated distributed systems hosted on a cluster of nodes. Helix automates the reassignment of partitions in the face of node failure and recovery, cluster expansion, and reconfiguration. For a detailed description of the architecture of the Apache Helix system and how it can be used to manage distributed systems, visit the website. You can also read about Helix’s concepts and design philosophy here.

Helix was originally developed within LinkedIn and is now widely used within and outside LinkedIn to manage critical production services that requires fault tolerance and expansion without any downtime. Within LinkedIn, Helix has been used to build and manage many of our distributed data systems and production services, such as: Espresso, LinkedIn's horizontally-scalable document store for primary data; Venice, our derived data serving platform for merging batch and streaming data sources; Ambry, LinkedIn’s distributed object store; Gobblin, a unified ingestion framework; and our real-time analytics platforms like Pinot. Helix has also been adopted by several companies, notable for Uber’s cross-datacenter Kafka replicator, Instagram’s IGDM, Box’s Box Notes, Turn’s high performance key value store, clustering in jBPM, and Pinterest’s Terrapin.

As we discussed above, the critical function of Helix is to manage the resources and partitions of the data systems. Helix internally employs a rebalancer workflow to determine how to allocate a resource’s partitions to nodes as the cluster topology changes. To manage a set of resources broken into partitions and replicas as nodes become available or unavailable, the Helix rebalancer can be plugged into a different strategy (we call it “rebalance strategy”) to determine the partition-to-node mapping given the current cluster state. The default rebalance strategy Helix had previously was a simple hash-based heuristic strategy.

However, we are moving to data centers with single top-of-the-rack switches, which introduce a single point of failure wherein the loss of a switch effectively means the loss of all machines in that rack. More generally, when running a cluster of nodes across multiple racks or multiple availability zones, the Helix partition placement logic needs to understand the cluster topology in order to perform replica placement such that the loss of a rack or an availability zone does not result in data unavailability or a service disruption.

In addition to topology-aware partition placements, there are a few other requirements we believe are important to Helix’s partition placement logic.

Even distribution of partitions. Partitions for a data resource should be evenly distributed among a cluster of machines, and ideally, replicas in the same state should be evenly distributed within these machines, too. An uneven distribution of partitions (thus data) could result in unbalanced service distribution, e.g., some machines get a higher number of requests than others, which could end up leading to service disruption or lower utilization.
Handle nodes with different processing capability. In our legacy default partition management logic, all machines within the same cluster are treated as having equal capacity. However, in real scenarios, machines in a cluster may not be homogeneous; some of them may be able to handle more traffic and data than others. The partition management logic needs to take into account these capacity differences when distributing the partitions.
Minimized partition movements during topology changes. As nodes become available or unavailable, or during topology changes (adding a node, removing a node, or adding new rack, etc.), the partitions may be reshuffled among other nodes to keep the resources load-balanced across the cluster. However, for a stateful distributed data service such as a data store, the cost of moving a replica to a different node could be high, since it usually involves bootstrapping data from other replicas or a remote backup. Therefore, the partition management logic needs to minimize these movements as much as possible.

In this post, we will discuss the topology-aware rebalance strategy we have recently introduced into Helix’s full-auto rebalancer, as well as how these features achieve the requirements we have just described.

Helix rebalancer

Helix's rebalancer is one of its most critical components. It is tasked with determining how to allocate resources to nodes as the cluster topology changes. To manage a set of resources broken into partitions and replicas as nodes become available or unavailable, the Helix rebalancer needs to determine the best partition-to-node mapping given the current cluster state, and to move partitions around nodes if needed to meet all the constraints while ensuring overall load distribution across the cluster.

The Helix rebalancer employs a rebalance workflow to compute the partition-node assignment and generate partition movement plans for a Helix-managed cluster. The graph below shows each individual task that will be executed during the rebalance workflow. Some of the important tasks include:

Gather current cluster state: This task is to collect all of the required cluster states, including instance liveness state, current replica state, and system configurations
Compute IdealState: After collecting the required cluster states, the rebalancer computes the new IdealState for each resource that satisfies all constraints. Helix’s rebalancer is designed to be able to plug in different rebalance strategies when computing the IdealState of the system.

Generate partition movement and state transition plan: In this step, the CurrentState is compared with the IdealState. If there is any difference, Helix will determine the partition movement, and thus the state transition plan, for each resource.
- Compute difference: Once the IdealState is computed, Helix looks up the state machine and computes the transitions required to take the system from its current state to IdealState.
- Apply constraints: Blindly issuing all the transitions computed in the previous step might result in violating constraints; for example, transitioning a primary replica to a secondary should happen before its counterpart secondary to primary transition in another node in order to make sure there is only at most one primary at a time. Also, there could be certain priority among different resources and throttling that is set by the application. Helix computes the subset of the transitions such that executing them in any order will not violate the system constraints.

Generate and issue state transition requests: This is the last task in the workflow, where the state transition requests are sent to the participants. On receiving the request, the Helix client agent that is embedded in the participant invokes the appropriate transition handlers provided by the application. Once the transition completes, the success/failure is reflected in the CurrentState. The Helix controller reads the CurrentState and re-runs the rebalance workflow until the CurrentState converges with IdealState. Once they are identical to each other, the system is considered to be in a stable state. Having this simple goal of computing the IdealState and issuing appropriate transitions such that CurrentState converges with IdealState makes Helix controller-generic.

Helix rebalance workflow

Topology-aware partition placement

In a typical modern data center deployment model with single top-of-rack switches, the switches introduce a single point of failure where the loss of a switch effectively means the loss of all hosts in the rack. More generally, when running a cluster of nodes across multiple racks, or multiple availability zones, Helix needs to be aware of the physical configuration of a cluster (cluster topology) so that it can ensure that replicas from the same partition are spread across different racks or zones. This is to minimize the risk of losing all data replicas, and thus data unavailability or service disruption, in the case of a failure of a rack or availability zone.

Helix employs a CRUSH-based algorithm to determine the partition placement. CRUSH is a scalable pseudo-random data distribution function designed for distributed object-based storage systems that efficiently maps data objects to storage devices without relying on a central directory. Because large systems are inherently dynamic, CRUSH is designed to facilitate the addition and removal of storage while minimizing unnecessary data movement. The algorithm accommodates a wide variety of data replication and reliability mechanisms and distributes data in terms of user-defined policies that enforce separation of replicas across availability zones.

By using a CRUSH-based algorithm, Helix’s new topology-aware rebalancing supports more flexible cluster topology by structuring a cluster into a tree. The application can configure the Helix rebalancer to isolate replicas at any layer of the tree (called availability zones). For example, the cluster may be comprised of hosts across a few rooms, where hosts in each room are in different racks, and hosts in each rack are divided into sub-racks. In this case, Helix will model your cluster as a tree with four levels. Helix can be configured to isolate replicas from the same partition at one of these four levels: room, rack, sub-rack, or host. One can imagine the strategy to cover the topology-agnostic case easily, which essentially is to put an isolation zone at the host or instance level.

To support such features, Helix introduces the cluster-level topology definition and instance-level domain configuration. Topology field is used to define the cluster tree structure, while domain in each instance is to specify the position of the instance in the cluster tree. The topology presents the structure and levels of a cluster, for example, /room/rack/sub-rack/host, which tells Helix that this cluster has four logic levels. Along with the topology, an isolation zone is also specified. Then a full configurable hierarchy domain ID is configured for each instance. For example, /prod-dc-us-west/rack-id1/sub-rack-2/node-123.prod/001. Helix makes sure the partition placements are isolated at the level of the isolation zone as specified.

Topology and availability zone defined in clusterconfig

Sample domain id defined in instance’s config

Partition distribution over homogenous machines

We have performed some analysis on how the Helix full-auto rebalancer distributes replicas among a cluster of nodes. In our experiments, we are mostly interested in seeing how evenly the replicas gets distributed among this cluster of nodes. We created a cluster with 96 nodes, which are split among 6 racks (i.e., 16 nodes per rack). All nodes are equally weighted.

The following two graphs show how different numbers of replicas get distributed among the nodes by applying the CRUSH algorithm directly. The first graph contains a cluster with a total of 120K replicas, while the second graph shows results from distributing just 2,000 replicas to the same cluster.

Replica distribution for a cluster with a large number of partitions (120k) using CRUSH. The green horizontal line shows the optimal distribution.

Replica distribution for a cluster with relatively small number of partitions (2,000) using CRUSH. The green horizontal line shows the optimal distribution.

Note the similar distribution of replicas across each rack when using a large number of partitions. By looking at the experiments results, we see that CRUSH can achieve relatively good distributions when the number of replicas in the cluster is large. On the other hand, if there is only a small number of replicas in a cluster, the distribution of these replicas could be varied and not be quite balanced.

Multi-CRUSH strategy

To alleviate the this uneven distribution of replicas when applying CRUSH directly, one improvement we have made is to apply the CRUSH algorithm multiple times. The first CRUSH pass generates the overall replica placement for nodes in all awareness zones, then the second round will be recapping and redistributed partitions within each zone to make sure they are more evenly distributed. We refer to this as Multi-CRUSH strategy.

In addition, as a comparison, we also implemented a consistent hashing-based partition placement strategy. In such a strategy, Helix builds multiple layers of hashing rings for a cluster, with each layer of the rings representing the corresponding layer of the cluster’s topology tree. For example, with a two-layer cluster tree (/rack/node), the rack-layer hash ring is used to determine which rack a given replica should be placed on. Then the second layer of hashing rings, one for each rack, determines the individual node to place this replica on.

The distribution of replicas by applying Multi-CRUSH strategy is demonstrated in the two graphs below. The first one is distributing 120 thousands replicas over 96 nodes, while the second one shows distribution of 2000 replicas over the same number of nodes. By comparing the results with the CRUSH strategy, we can see that the replicas are distributed more evenly when Multi-CRUSH strategy is used, both when there is small number of replicas as well as with a large number of replicas.

The last graph in this section compares the relative standard deviation (RSD) of replica distributions for a variety of numbers of replicas when applying one of the three strategies outlined above. The results suggest that the CRUSH strategy generally achieves better distribution than consistent-hashing, while Multi-CRUSH achieves the best in terms of evenly distributing replicas among all three strategies.

Replica distribution for a cluster with a large number of partitions (120k) using Multi-CRUSH. The green horizontal line shows the optimal distribution.

Replica distribution for a cluster with relatively small number of partitions (2,000) using Multi-CRUSH. The green horizontal line shows the optimal distribution.

Relative standard deviations for replica distribution with variety number of partitions in the cluster from three different placement strategies.

Partition distribution over heterogeneous machines

In our second set of experiments, we again created a cluster with 6 racks, with each having 16 nodes (total of 96 nodes). However, this time the nodes within each rack were weighted differently, e.g., the first 10 nodes are given a weight score of 1, the next 3 nodes with weighted score of 2, and the remaining 3 nodes with weighted score of 3. With this setup, the ideal case is that the number of replicas assigned to a node with score 2 should be double the number of replicas assigned to nodes with score 1.

The following three graphs show the results of distributing 120K replicas to such a cluster by Helix using our three strategies: pure CRUSH, Multi-CRUSH, and consistent hashing. The results are similar to those we saw in the last section. Overall, Multi-CRUSH achieves a more even distribution of replicas among all three strategies with a cluster of heterogeneous nodes.

Replica distribution for a cluster of heterogeneous nodes with 120k replicas using CRUSH strategy.

Replica distribution for a cluster of heterogeneous nodes with 120k replicas using consistent hashing strategy.

Replica distribution for a cluster of heterogeneous nodes with 120k replicas using Multi-CRUSH strategy.

Additional Enhancements

In addition to new topology-aware feature we have discussed, there are a few other enhancements we have introduced to Helix’s rebalancer recently, which are described briefly below. More details about these enhancements are discussed here.

Delayed Partition Movement. There are many cases when a node temporarily goes down. For example, scheduled maintenance, a node crash, long gc pause, intermittent network partition, etc. In these scenarios, we would like to minimize the partition movements as much as possible while maintaining various constraints and availability needs from applications. With delayed partition movement, Helix’s rebalancer is able to minimize the reshuffle of the resident replicas on a node in case the node experiences a short-time outage, but meanwhile it also provides flexibility to allow applications to maintain certain constraints and availability needs.

Partition Movement Throttling. Helix moves partitions whenever nodes’ availability changes, or the load among existing nodes is imbalanced. This balancing procedure is usually transparent to the user and application. However, partition movements carry some overhead in terms of bandwidth and workload, both of which can impact application performance. This feature makes Helix able to throttle on the partition movements and move only a (configurable) number of partitions at a time.

Partition Movement Priority. When partitions are being moved, certain types of partition movement should be treated as higher priority than others. For example, bringing up a primary replica for a partition should always take higher priority than bootstrapping a backup/secondary replica. With this enhancement, Helix allows an application to specify a certain order in which Helix should move the partitions.

Maintain Availability During Rebalance. This enhancement makes sure that Helix always tries to maintain a minimal number of active replica when moving partitions across machines during the rebalancing process.

Future work

Partition management is critical to distributed data systems and it is non-trivial work to build such logic into these systems. Helix's rebalancer is tasked with determining the best partition-to-node mapping as the cluster topology changes, and with moving partitions around nodes if needed to meet all the constraints while ensuring overall load distribution across the cluster. We are continuously incorporating more features into Helix’s rebalancer, as well as improving its scalability and operability. Two interesting improvements we are currently working on are:

Helix’s rebalancer currently assumes that all data partitions from the same or different resources are equally sized. We will enhance it to better support systems with resources having partitions with different sizes (weights).
Helix’s rebalancer is not aware of the resource usage or load of individual machines when it places partitions onto them. We are going to add a resource monitoring layer to Helix so that machine load/responsiveness will be considered when Helix rebalances partitions.

Acknowledgements

Designing, developing, and launching Helix’s full-auto rebalancer has truly been a team effort. The engineers who were involved in this effort include Junkai Xue, Jiajun Wang, and Weihan Kong, and our SRE, Eric Blumenau. We would like to thank Tom Quiggle, Yun Sun, Yulin Li, Shuangyang Yang, Yan Yan, Gaojie Liu, and other dev and SRE engineers from our storage infrastructure teams for their invaluable input and feedback on Helix. In addition, we want to thank Kishore Gopalakrishna and other committers from the Apache Helix community. Finally, all of this work is not even possible without constant encouragement and support from our management, Eric Kim and Swee Lim, as well as the support provided by the managers of our partner storage teams: Ivo Dimitrov, Kumar Pasumarthy, Hardik Kheskani, Alok Dhariwal, and Siddharth Singh.

Topics: Open Source Data Management Distributed Systems