Improving Espresso availability with preemptive Helix-managed traffic shift

August 11, 2020

Co-authors: Gaurav Mishra, Song Lu, Antony Curtis, Shuangyang Yang

Espresso is LinkedIn's horizontally scalable, highly-available, and elastic data-as-a-service platform that serves nearly 95% of our online storage traffic. Given its position in our tech stack, an optimization in Espresso availability can have a major impact on our members’ experience.

In this blog, we explore an improvement in availability by using a Helix CurrentStates-based Router. We’ll discuss its implementation, compare it with the legacy Router, and review the performance improvements we saw on our production clusters. With this change, the leadership handoffs became faster by 63% at the 99th percentile (i.e., 99% of leadership handoffs were faster by at least 63%) resulting in a 63% improvement in availability. Consequently, read/write errors observed by our clients due to Espresso unavailability were reduced by 83% on average.

For reference, we’ve provided an explainer on Espresso’s major components and terminology at the end of this post.

Leadership handoff: The cause of unavailability

Leadership handoff is the process of taking away the leader status from the current leader replica and assigning it to a follower replica. Helix initiates this transition for a replica labeled as leader when it determines it can no longer serve traffic. This could happen when the Storage Node that hosts the replica has crashed, the Storage Node is unable to communicate with Helix, or the Storage Node is shutting down for deployments, among other reasons. As there can be at most one leader replica per partition, the current leader replica has to transition to a follower before a follower replica transitions to leader. Unavailability happens during this brief period of a leaderless state where none of the replicas are labeled as leader.

Router's routing table must be updated after a successful leadership handoff so that it can route requests to the leader replica. From the point of view of the Router or client, the total leaderless duration is the sum of the following:

Leadership handoff duration: The duration between leader to follower transition on one Storage Node and follower to leader transition on another Storage Node
Routing table update duration: The duration between a successful leadership handoff and the Router getting the routing table update from Helix

diagram-showing-the-replica-state-transition-flow

The shorter the duration during which there is no leader, the higher the availability. In this blog, we’ll talk about a new implementation of the Router that receives faster state transition updates from Helix and results in a significant reduction in routing table update duration.

Leadership handoff results in errors for both leaderOnly and leaderThenFollower requests. To understand these errors, we need to go a bit deeper into the replica state transition flow. Here is what happens when Helix initiates leadership handoff for a leader replica by sending a state transition request to Storage Node SN1:

Storage node SN1 performs leader to follower state transition
Router gets a callback for leader to follower transition on Storage Node SN1 (not shown in the diagram above)
Storage node SN1 performs follower to offline state transition
Router gets a callback for follower to offline state transition on Storage Node SN1
Storage node SN2 performs follower to leader state transition
Router gets callback for follower to leader state transition on Storage Node SN2

(For simplicity, we have shown all the steps sequentially. In reality, step 3 and step 5 can start in any order and run in parallel. Similarly, step 4 and step 6 can happen in any order.)

Errors resulting from leaderOnly requests occur between step 1 to step 6 as the leader replica is not available during this duration. Errors resulting from leaderThenFollower requests occur between step 3 and step 4. During this duration, the replica has become offline, but the Router still believes that it is a follower. After step 4, the Router stops sending any requests to the offline replica resulting in no more errors. As you can see, the Router is critical in the transition and any optimizations in the Router will reduce the leaderless duration. Let's look at the Router in more detail.

ExternalView-based Router

The Router gets an ExternalView update callback whenever the ExternalView is updated in ZooKeeper. Upon getting this callback, the Router reads the InstanceConfigs for all Storage Nodes from ZooKeeper. The Router then updates its routing table using both ExternalView and InstanceConfigs. The process of a state transition update using the ExternalView-based Router can be summarized as follows:

Helix sends state transition request to Storage Node
Storage node performs state transition
Storage node updates its CurrentState in ZooKeeper
Helix controller gets CurrentState update callback from ZooKeeper
Helix controller generates the ExternalView
Helix controller writes the ExternalView to ZooKeeper
Router gets ExternalView update callback from ZooKeeper
Router reads InstanceConfigs from ZooKeeper
Router processes the ExternalView and InstanceConfigs
Router updates the routing table

diagram-showing-the-state-transition-process-in-the-external-view-based-router

CurrentStates-based Router

We observed that both the generation of ExternalView by Helix and the reading of InstanceConfigs by Router were time consuming. Both these steps could be avoided since the Router does not directly use ExternalView. Instead, the Router can update the routing table by reading the CurrentStates directly from ZooKeeper. Moreover, this also reduces the number of network hops resulting in a further reduction in the routing table update duration. The process of a state transition update using the CurrentStates-based Router can be summarized as follows:

Helix sends state transition request to Storage Node
Storage node performs state transition
Storage node updates its CurrentState in ZooKeeper
Router gets CurrentState update callback from ZooKeeper
Router updates the routing table

diagram-showing-the-state-transition-process-in-the-current-states-based-router

Design choices

We implemented the CurrentStates-based Router and compared its performance with the ExternalView-based Router on our test clusters by simulating a production environment. From the first implementation, we observed some significant improvements and identified a few areas for further finetuning. The following sections describe our observations and design modifications.

Faster updates for the majority of state transitions
With the CurrentStates Router, the routing table update duration was reduced by 45% at the 99th percentile. However, not all state transitions were faster for the CurrentStates Router. We observed that for 10% of cases, the ExternalView Router was faster.

There are 3 reasons behind this: the propagation delay within the ZooKeeper cluster, the process the Helix Controller generates the ExternalView, and the way the CurrentStates Router (through Helix Client) reads the CurrentStates from the ZooKeeper. The Helix Controller, Routers, and Storage Nodes connect to one of the servers in the ZooKeeper cluster to read/write CurrentStates/ExternalView. The Helix Controller runs a periodic pipeline to generate the ExternalView. It could be connected to a ZooKeeper server which is receiving an update from a Storage Node. In that case, the Helix Controller gets the update faster as there is no propagation delay within the ZooKeeper server. Since the pipeline runs periodically, the Helix Controller generates the ExternalView with whatever information it can get in that period and writes it to the ZooKeeper server. If the ExternalView Router is also connected to the same ZooKeeper server, it will receive the ExternalView update as soon as the Helix Controller writes it to the server.

Meanwhile, the CurrentStates Router can be connected to that ZooKeeper server which is last to receive the update. Moreover, after receiving this update, the Router tries to read all other CurrentState updates from the ZooKeeper before updating its routing table. Because of these reasons, there are a few instances where the CurrentStates Router becomes slower.

Less write errors, but more read errors
As the duration of the average routing table update was reduced, we expected a significant reduction in read/write errors. We observed a 50% reduction in write errors, but a 250% increase in read errors.

Upon investigation, we found the root cause in the lack of follower to offline callback (step 4 in replica state transition flow) for the CurrentStates Router. As we described in the previous section, the CurrentStates Router reads all state updates from the ZooKeeper before updating the routing table. This makes the Router fast in terms of getting the final state update (new leader), but does result in instances in which the intermediate callback is missed.

chart-showing-the-state-transition-timing

In the chart above, the CurrentStates Router starts reading state updates from the ZooKeeper as soon as the ZooKeeper receives SN1’s follower to offline (F->O) update (depicted above are the steps for one replica, but for multiple replicas of different partitions, these steps will be performed simultaneously). Then, it goes on reading other state updates and by the time it finishes, it receives SN2’s follower to leader (F->L) update as well. So, the routing table is updated for both SN1’s F->O update and SN2’s F->L update in the same callback. On the other hand, as the ExternalView Router runs a periodic pipeline to get the ExternalView, it receives SN1’s F->O update earlier for most of the cases.

After receiving the SN1’s F->O update, the Routers stop sending requests to SN1. As the CurrentStates Router remains unaware of SN1’s F->O update for a longer duration than the ExternalView Router, it ends up sending requests to the offline replica for a longer duration resulting in a higher number of errors.

Getting the best of both worlds with hybrid approaches
To reduce read errors we had to find a way for early detection of offline replicas. We tried multiple approaches to overcome this problem:

Use both CurrentStates and ExternalView to update the routing table: Whenever the Router gets a callback for either the ExternalView update or the CurrentStates update, it updates its routing table. This approach resulted in more callbacks to the Router, had issues related to race conditions for state updates, and did not help much in error reduction.
Use the CurrentStates-based Router and add a listener to liveInstances: The idea was that before we sent a request to the Storage Node, we would assess whether it is active or not. Even if the routing table says that the Storage Node has the leader or follower replica, the Router would not send the request to that Storage Node if it is not active. The hope was that this would result in a smaller number of errors. Unfortunately, Helix marks the Storage Node as inactive in its list of liveInstances after the node has finished all the processing for graceful shutdown. This time duration was much higher than the overall leaderless duration, and we concluded that this approach didn't help either.
Use the CurrentStates-based Router and add a listener to instanceConfig: The idea was similar to the second approach above. The advantage with this approach is that the Helix marks the Storage Node as inactive in its instanceConfig set immediately after it sends the shutdown request to the Storage Node. The instanceConfig listener in the Router gets a notification as soon as the Helix writes the new instanceConfig to the ZooKeeper. Router then reads the new instanceConfig from the ZooKeeper. Thus, the Router stops sending requests to that Storage Node earlier than receiving the CurrentStates update callback for follower to offline transition. This approach gave better results both in terms of the routing table update duration and read/write errors. The only drawback was the increased read traffic on the ZooKeeper caused by the Router.
Have Helix provide a callback to the Router for instanceConfig changes: Based on the results of the third approach, we asked the Helix team to provide a callback to the Router for instanceConfig changes in addition to callbacks for CurrentStates changes. In addition to providing a callback, they made the following changes to make the process more efficient:
Cache refresh with a selective update: Whenever a change happens, the cache will be refreshed only for that change type (e.g., instanceConfig change, CurrentStates change, ExternalView change, etc.) instead of a full cache refresh.
Improved change queuing and handling events: Before the change, instanceConfigs could be clogged by CurrentStates refresh resulting in a long wait time for the instanceConfig update callback to Router. After the change, Router callback will be called immediately as soon as the instanceConfig gets updated.

Results

We compared the performance of the ExternalView Router and the CurrentStates Router during deployments in a production environment on multiple clusters with different sizes and kinds of workloads. Here we share the results for our largest cluster with the read-heavy workload. The results on other clusters are similar.

Leaderless duration (seconds):

graph-showing-improvements-in-the-duration-of-leaderless-duration

Read/write errors (10^-3/sec):

graph-showing-improvements-in-read-write-errors

Conclusion and future work

Moving from the ExternalView-based Router to CurrentStates-based Router entailed a major architectural change in Espresso. The Router is a fundamental component in Espresso, and it took careful design considerations and rigorous testing to ensure that the design change did not introduce any unwanted side effects. With the CurrentStates-based Router, we were able to significantly reduce one component contributing to the leaderless window: routing table update duration.

Our future projects will focus on reducing the other component: leadership handoff duration. Discussions are underway for improvements in several areas including ZooKeeper propagation latency, Kafka event producer/consumer latency, and storage node state transition latency.

Acknowledgments

Many people have contributed to this project. I would like to thank Antony Curtis, Shuangyang Yang, Song Lu, and Dan Bahir for laying the foundation of this project. A special thanks to Song Lu for his continued guidance and invaluable contributions throughout the project. Thanks to Kiran Chand Kanakkassery for providing site support. Thanks to the members of the Helix team: Jiajun Wang, Kai Sun, Hunter Lee, Huizhi Lu, Junkai Xue, and Yi Wang. I would also like to thank Yun Sun, Zac Policzer, Junkai Xue, Jeba Singh Emmanuel, Heyun Jeong, and Jaren Anderson for reviewing this blog. Finally, a great thanks to the leadership team Alok Dhariwal, Kumar Pasumarthy, Lei Xia, and Ivo Dimitrov for their continuous support and guidance throughout this work.

Further background

Espresso architecture

Client: Application sending a request (read/write) to Espresso.

Router: Receives client requests, inspects the URI, forwards the request to the appropriate Storage Node(s) using its routing table, assembles a response, and sends it to the client.

Routing Table: This is the component in the Router that is consulted in order to decide the Storage Node to which the incoming request should be routed to.

Storage Node: Data is partitioned across multiple Storage Nodes. Each partition has 3 replicas.

Replica States: A replica in a Storage Node can either be a leader, follower, or offline.

Leader replicas serve write requests and replicate them to follower replicas. For read requests, Espresso makes the best effort to serve them from the leader replica
Follower replicas may receive read requests when the leader is slow or unavailable
Offline replicas do not handle requests. Any request sent to the offline replica results in an error. A replica usually moves to offline state when the Storage Node goes offline or when the cluster rebalances

State transition: A state transition for a replica happens when it moves from one state to another.

Helix: Espresso uses Apache Helix as its cluster manager. It assigns partitions to Storage Nodes in accordance with these constraints:

Only one leader replica per partition (for consistency)
Leader and follower replicas are assigned evenly across all Storage Nodes (for load balancing)
No two replicas of the same partition may be located on the same node or rack (for fault-tolerance)
Minimize/throttle replica movement during cluster expansion (to control impact on the serving nodes)

ZooKeeper: Helix uses Apache ZooKeeper to persist the state of each replica in the cluster.

CurrentState: The state of each replica in ZooKeeper is called its CurrentState.

ExternalView: A aggregation view of CurrentStates for all replicas in ZooKeeper. For instance, if we have a database myDB with 4 partitions each partition having 3 replicas, the ExternalView would look like this:

RequestType: Clients can send two types of requests to Espresso:

LeaderOnly requests are served only if the leader replica is available. Error is returned to the client if the leader is not available.
For LeaderThenFollower requests, Espresso makes the best effort to serve the request from the leader replica. A request is sent to a follower replica in case the leader is unavailable. An error is returned if both leader and follower are unavailable.

All write requests are leaderOnly requests. For read requests, clients have an option to specify whether the read request is leaderOnly (for high consistency) or leaderThenFollower (for high availability).

InstanceConfig: The configuration of each Storage Node is called its InstanceConfig. Helix updates it for any changes made to the Storage Node’s config. This information is persisted on ZooKeeper.

LiveInstance: A Storage Node is considered a liveInstance if it is online. Helix maintains a set of liveInstances that gets updated when Storage Nodes go online or offline.

Topics: Open Source Data