HTTP/2 in infrastructure: Ambry network stack refactoring

August 24, 2021

Co-authors: Ze Mao, Matt Wise, Casey Getz, Justin Lin, Ashish Singhai, and Rob Block

Introduction

Ambry is LinkedIn's scalable geo-distributed object store. Developed in-house and open sourced in 2016, Ambry stores tens of petabytes of data.

At LinkedIn, Ambry is used to store objects like photos, videos, and resume uploads, as well as internal binary data. After years of production use, Ambry has proven to be stable and easy to operate. But there have been challenges along the way, the most recent being around the network bottleneck between the Ambry frontend and storage nodes. In this post, we will explain how we successfully adopted Netty-based HTTP/2 in our infrastructure service to solve the network bottleneck and examine the performance of the new stack.

Ambry architecture review

Before we discuss the network bottleneck, let’s first review the current Ambry architecture. Ambry includes frontends, storage nodes, and a cluster manager. The frontend nodes are stateless and provide HTTP interfaces for clients and route requests to the storage tier. The storage nodes perform data persistence, compaction, and replication, while the cluster manager manages cluster membership.

Ambry provides a handle store API. Ambry clients send requests to an Ambry frontend chosen via a load balancing layer. On receiving a PUT request, the frontend generates a Blob ID, routes the request to storage servers for persistence, and then returns the Blob ID to the client. With this Blob ID, the client can request the data by sending a GET request.

There are several benefits of this design:

The frontend tier is stateless and can be expanded easily. This affords cheaper frontend hosts that don’t need special hardware to support multiple disks.
Multiple features can be implemented in frontends, including: encryption, compression, signed URL, ACL, and so on.
Storage node design is simplified, as it only needs to take care of data persistence and not business logic.

Here is how PUT and GET requests are processed inside the Ambry cluster. For PUT, (1) the client sends a request to the frontend, (2) the frontend chooses a partition, (3) generates the Blob ID, and then (4) sends parallel requests to a configured number of replicas on storage nodes. (5) Once a quorum of successful responses are received, (6) the frontend returns a HTTP response to the client, along with the Blob ID. See diagram below:

For GET, (1) the client sends a request to the frontend, (2) the frontend determines partition based on Blob ID and then (3) sends requests to a configured number of replicas on storage nodes. (Note: with AdaptiveOperationTracker, we can defer sending the second request until the first one takes longer than the 95th percentile of observed latency). As soon as one response is (4) returned successfully to the frontend, (5) the frontend returns the blob data to the client. See flow diagram below:

The protocol between frontends and storage nodes was a TCP socket based custom protocol prior to the HTTP/2 refactoring described in this post. It comprised a TCP client in the frontend and a TCP server in the storage node.

The client is in the frontend's router library, which is a non-blocking, multi-connection client based on the Java NIO selector. The client processes all requests and responses on a single thread asynchronously, and the number of clients is scalable. In practice, we create multiple router instances based on CPU cores.

The server is a threading model based TCP server on storage nodes.

1 Acceptor thread that handles new connections
N Processor threads that each have their own selector and read requests from sockets
M Handler threads that handle requests and produce responses back to the Processor threads for writing.

In addition, each Ambry frontend holds a pool of live connections to all storage nodes.

Network stack bottleneck

The design above worked well when there were hundreds of storage nodes and total storage capacity was less than 5 PB. However, with the growth of the cluster and SSL communication requirements, frontends became a bottleneck in our system. We encountered the following issues:

Ran out of file descriptors because more connections were required.
- We solved this by changing the max allowed file descriptors from 200K to 500K.
Ran out of connections.
- With our old stack, multiplexing was not supported. A connection would be fully occupied until a response was fully received. There were no issues when the connections were plain text based; however, with SSL, the connections were easily exhausted because of the SSL setup. We addressed this to some degree with 30 pre-warmed connections between each frontend and storage node. Subsequently, the increased connection occupancy due to SSL encryption/decryption continued to cause connection pool exhaustion, leading to latency deterioration during regular traffic and large spikes under heavy load.
Experienced bad SSL performance from JDK.
- The Java SSLEngine is not CPU efficient. We also noticed it uses a small SSL packet (16 KB) compared to OpenSSL (64 KB), which likely impacts throughput. Here are some comparisons presented by Netty’s author, Norman Maurer.
Needed network related performance tuning, especially after we enabled SSL.
- We had high SSL handshake latency in the beginning and added SSL connection warmup to mitigate the handshake latency.
- We ran out of memory because SSL encryption and decryption needed an extra memory copy, and solved this by increasing JVM memory size.
- Our SSLTransmission implementation issued small IOs, which impacted the overall performance. We batched these small IOs to address this issue.
- To compensate for SSL-induced latency, we added a thread pool for encryption and decryption to unblock the selector's polling thread.

We solved issues for 1 and 4, but the others were still fundamental bottlenecks that prevented us from expanding Ambry’s cluster size, supporting very large blobs, and providing optimal performance. Considering the above issues, we decided to refactor the Ambry network stack between the frontend and storage nodes.

Ambry network stack refactor goals

For a single frontend, the fundamental issues could be summarized as:

Each frontend needs to maintain tens of live connections to each storage node, which will eventually prevent us from adding more storage nodes.
SSL encryption and decryption are time consuming and bottleneck the overall throughput. A high performance SSL library would be expected to save resources and reduce latency.

Therefore, for the next-generation Ambry network stack, the primary design goals were:

Improved SSL performance.
Connection multiplexing for cluster scalability and performance.
A high performance single-client-multiple-servers implementation in the frontend.
A good network framework to reduce tuning time.

Our technology choice: Netty-based HTTP/2
Netty was a natural candidate, as it’s a full-fledged framework and multiple products at LinkedIn (e.g., Venice, Espresso, and Rest.li) have used it for a long time. Netty also provides great performance due to its event-driven design and JNI-based SSLEngine, which met our goals well.

In order to reduce total connections, we looked into connection multiplexing. Instead of extending our current protocol to support multiplexing, we explored HTTP/2. HTTP/2 is a major revision of the HTTP network protocol. It enables the use of connection multiplexing and focuses on performance—specifically, latency, network, and server resource usage, which are all desired by Ambry as an infrastructure service. In addition, HTTP/2 has a binary framing layer for data encapsulation and transfer. This allows us to encapsulate our previous customized protocol into an HTTP/2 data frame without extensive refactoring.

Netty started to embed HTTP/2 implementation years ago, and now it (netty 4.1.54-Final) provides a robust HTTP/2 implementation that is easy to use.

Based on the above, we decided to use Netty-based HTTP/2 for our network stack refactoring.

Design and implementation

In the new stack, we refactored both the frontend and storage node network implementations.

On the storage nodes, we used Netty’s classic design pattern. We added handlers to the Netty pipeline to handle inbound messages. At the end of the inbound pipeline, we pass requests to our existing IO threads for data fetch or persistence. Responses are eventually passed back via Netty outbound pipelines. The implementation can be found on GitHub.

On the frontend side, we created Http2NetworkClient and its related components to leverage Netty HTTP/2. Http2NetworkClient is a multi-connection, multiplexing, asynchronous client. It’s the core of this refactoring. The diagram below is a high level picture of it.

Multi-connection
In our design, each frontend establishes a fixed number of connections to a storage node. These connections are managed by Http2MultiplexedChannelPool. The number of connections per storage node is configurable and connections are selected in a round robin order.

Usually, a single connection from client to server is enough in the context of HTTP/2. However, Ambry is not a simple web app. As an infrastructure storage system, Ambry needs to provide:

Availability: we need to avoid single connection failure. Re-establishing a connection is time consuming and impacts availability.
Throughput: Ambry network throughput is huge. However, HTTP/2 is based on TCP, and a single TCP connection’s throughput is fundamentally limited by TCP flow control, TCP buffers, and a TCP-based head of line block issue. Therefore, we create multiple connections to load-balance the overall throughput and mitigate any delay caused by the TCP layer.

Below is an experiment we did to compare throughput with different numbers of connections. (Experiment setup: same workload (moderate load) to a single frontend and the single frontend is connected to 300 storage nodes. All hosts’ NIC are 10 Gbps).

table
Number of connections	Throughput (MB/s)
1	732
2	891
4	963
8	971

Multiplexing
When Http2NetworkClient sends a request, it acquires a HTTP/2 stream channel from Http2MultiplexedChannelPool. Http2MultiplexedChannelPool manages the underlying multiplexing logic based on the Netty HTTP/2 implementation.

Asynchronous
Http2NetworkClient has an asynchronous single thread. The thread sends all requests to the Netty pipeline and returns all the ready responses to the original dispatchers. Thanks to Netty’s eventloop and Promise, most of the operations (connection acquire, stream acquire, writeAndFlush, and so on) are done asynchronously without blocking the thread.

Customized protocol encapsulation
Remember that Ambry previously used a custom protocol for TCP communication. For the refactor, we reused existing binary message formats and wrapped them in an HTTP/2 data frame. This simplified the migration because we were able to reuse the request parsing logic. In addition, we:

Changed the protocol’s data holder from ByteBuffer to Netty’s ByteBuf. This allowed us to migrate to Netty’s memory management and send requests/responses via the Netty pipeline.
Introduced AmbrySendToHttp2Adaptor to slice the protocol’s data to HTTP/2 data frames.

Leveraging open source
For the frontend’s implementation, we leveraged open-source code from here, which provided a good starting point for us to implement multiple connection and multiplexing HTTP/2 clients.

Rollout

To avoid cluster-level failure and data inconsistency issues, we used dual-protocol mode on storage nodes to smooth the rollout.

On servers, our plan was to serve old socket-based requests and new HTTP/2 requests at the same time with different ports, so storage nodes could simultaneously handle both types of requests. On the frontend, we canaried hosts by enabling HTTP/2-based requests and rolled back defective changes (if needed) by switching back to socket-based requests.

With this rollout plan, we gradually validated critical changes and successfully enabled HTTP/2 stack in all of our production clusters.

Performance

Latency with regular load
In production hosts, we observed lower latency after switching to the HTTP/2 stack (red curve). Memory and CPU utilization were also reduced.

graph-showing-lower-latency-after-switching-to-the-http2-stack

table
	Memory (heap and off-heap)	CPU	Router to server latency P95
Socket SSL	6.6 GB	7-10%	100 ms
HTTP/2 SSL	4.9 GB	4%	38 ms

The improvement is expected. After all, one of the reasons we use Netty is its JNI-based SSLEngine. Its performance is much better than Java’s default SSLEngine (see here for Norman’s presentation on this topic). In this case, Netty's SSLEngine reduces SSL encryption/decryption cost and lowers latency and hardware resource usage.

Latency with load test
We performed multiple tests with different loads (We send traffic to frontends and trigger different GET rates between frontend and storage nodes) to compare the performance between the socket stack and HTTP/2 stack. The result is shown in Figure 1. With the HTTP/2 stack, the latency is much lower compared to the socket SSL implementation. The improvement comes from both Netty’s efficient SSLEngine and HTTP/2 multiplexing.

frontend-to-storage-nodes-get-latency-graph

Figure 1

Stability/Connection multiplexing
With the socket SSL stack, a big issue was connection exhaustion between the frontend and storage nodes. Once connections are exhausted, new requests need to wait for available connections from the connections pool. We increased the pool size multiple times, but that also required increasing the Java heap, which resulted in more GC work.

Connection exhaustion caused latency spikes in our production cluster. See Figure 2 and Figure 3 below. Figure 2 shows connection exhaustion occurrences. The blue curve in Figure 3 is the latency between the frontend and storage nodes. As you can see, latency spikes are highly correlated to connection exhaustion.

With HTTP/2 multiplexing, the number of connections is no longer a problem and we see significant improvement on latency spikes. See the red curve in Figure 3.

In the socket stack, 30 connections were used between each frontend and storage node.

In the HTTP/2 stack, only 4 HTTP/2 connections are used.

chart-showing-connection-exhaustion-occurrences

Figure 2

graph-showing-latency-spikes-are-highly-correlated-to-connection-exhaustion

Figure 3

TCP buffer size tuning
On the TCP level, each TCP connection has a buffer at the OS level for the sender to write and the receiver to read. The default buffer size is usually between 128 KB and 512 KB, depending on the operating system. However, Ambry blobs are big (4 MB) in some of our use cases and HTTP/2 multiplexing shares TCP connections. It’s good to use a bigger TCP buffer size to avoid data getting stuck at the application level.

In production, we use a 4 MB TCP buffer size.

HTTP/2 tuning
We tuned two HTTP/2 parameters: initial window size in HTTP/2 stream flow control and HTTP/2 frame size.

The default initial value for the flow control window is 64 KB and default frame size is 16 KB. In Ambry, blob sizes are in the MB range. Data may be split to multiple frames and need multiple round trips to transfer. To avoid multiple round trips for a single request, we increased the initial window size and max frame size to 4 MB.

However, when we made this change, it didn’t seem to take effect, because we still saw small frames. We looked into it and found that we needed to address another important setting: WriteBufferHighWaterMark, which is a threshold to decide if a channel is writable or not. The Netty HTTP/2 implementation uses min(window size, WriteBufferHighWaterMark) to decide actual frame size. With the change to WriteBufferHighWaterMark, we were finally able to see large frames.

Figure 4 is a performance comparison between the 64 KB initial window size and 4 MB initial window size. Both are with 4 MB WriteBufferHighWaterMark and 4 MB TCP buffer size.

Figure 4

Netty best practices

The Netty best practices presentation made by Norman Maurer points out many good practices in Netty-based development, and may be useful for others who want to avoid pitfalls in implementations.

Summary

With HTTP/2, we successfully solved Ambry’s network bottlenecks between Ambry frontends and Ambry storage nodes. The new stack saves CPU and memory, as well as reducing latency. The benefits come from both Netty (efficient eventloop, OpenSSL integration, and memory management) and HTTP/2 multiplexing. This allows us to scale Ambry to a much bigger cluster size and enables throughput-oriented (as opposed to latency-oriented) use cases. Now, the HTTP/2 stack has been fully deployed in our production clusters for months and has been operating stably.

Acknowledgements

This took an immense amount of work across the Ambry team. Specific shout outs to Casey Getz and Justin Lin for their knowledge and help in Ambry socket stack and Netty; Matthew Wise, Ashish Singhai, and Rob Block for providing suggestions on design and feedback to this blog post; and all Ambry Team members: Matthew Wise, Rob Block, Casey Getz, Justin Lin, Yingyi Zhang, Sophie Guo, Ankur Agrawal, and Arun Sai for brainstorming and ideas.

Topics: Open Source Data Infrastructure