Scaling storage in the datacenter with RoCE
May 2, 2022
Co-authors: Ishan Shah, Alasdair James King, and Zhenggen Xu
To support the consistent growth in hyperscale environments, our infrastructure relies on a plethora of storage databases and data processing nodes across its fleet. Those services reap performance benefits from NVMe’s (Non Volatile Memory Express) super-low latency. Currently, many services still use in-node NVMe drives. With the growing minimum size of NVMe drives vs. the amount of data used per node, this has resulted in substantial under-utilization of NVMe space (stranded capacity) and IOPs (input/output operations per second), as well as making it more challenging to scale storage independently of compute in our data centers, slowing down our ability to rapidly scale our storage footprint. With cost per used capacity at such an imbalance, we needed to figure out a way to scale flash storage independently of compute while retaining performance. This posed an important challenge for the team: to create a reliable data center network that not only provides high throughput and low latency, but also ensures minimum CPU and memory overhead in supporting a shared pool of SSD storage that can be provisioned on demand while guaranteeing optimal network performance. In delivering this service, our engineers now have the infrastructure to build low latency and highly scalable services to deliver the best member and customer experience to date.
In this blog post, we’ll discuss how we solved this challenge by providing a flexible, scalable, performant and reliable block storage service using Remote Direct Memory Access (RDMA).
Why do we need Remote Direct Memory Access (RDMA) in the data center?
When building distributed storage clusters with improved data protection and fault-tolerance, we needed to configure a lossless network to enable high-performance storage capabilities along with low latency to ensure that customers received a similar performance as they would with having locally attached storage. To solve this, we leverage RDMA (Remote Direct Memory Access) over Converged Ethernet version 2 (RoCEv2), which is a network protocol to enable the use of RDMA over a commodity Ethernet fabric commonly used in data center networks. RDMA allows for remote access to the memory of the system without interfering with the CPU processing on that system. Packet processing is offloaded to the network card, avoiding any unnecessary protocol translations and providing low latency and high bandwidth to sender and receiver.
LinkedIn’s data center network is a multi-tier Clos switching architecture with a leaf and spine layer within the top fabric layer. Every access switch or Top of Rack (ToR) has uplinks to four fabrics, with ECMP (Equal Cost MultiPath) setting the desired path for data traffic. Since 2017, SONiC has been the de facto switch operating system at LinkedIn. While leveraging the open source repository by Microsoft, SONiC has been modified and extended to align with LinkedIn’s infrastructure requirements by our in-house network software development team. This blog will focus on our implementation, rather than the technology itself, but there are some very good resources available online to learn more about RDMA and RoCE technologies.
Prerequisites for implementation
Before we deep-dive into the implementation details, it’s important to outline the network requirements necessary to achieve a lossless Ethernet network.
All applications (RoCEv2 and other TCP traffic) share the same links/ports.
Trust L3 DSCP (Differentiated Services Code Point) markings to queue RoCEv2 traffic and regular TCP traffic.
Only storage traffic leverages RoCEv2.
Lossless buffer configuration for RoCEv2 mapped queues.
Support DCQCN (Data Center Quantized Congestion Notification) for RoCEv2 traffic, including ECN (Explicit Congestion Notification) and CNP (Congestion Notification Packet).
LinkedIn’s implementation for RoCEv2 QoS
LinkedIn needed to implement a multi protocol strategy to allow RDMA and TCP traffic to coexist. To achieve that, Quality of Service was set up on each of the participating switches. DSCP values are used to distinguish the RoCEv2 traffic (marked with dscp value 26) and regular TCP traffic (marked with dscp value 0), where the marking of RoCEv2 is offloaded to server NICs (Network Interface Controllers). On the data center ToR switches, we classify the marked traffic to the following (TC = Traffic Class):
- RDMA traffic with dscp 26 → TC 3 → Queue 3
- CNP traffic with dscp 48 → TC 6 → Queue 6
- Regular traffic with dscp 0 → TC 0 → Queue 0
The SONiC switches are configured with lossless buffer pools for RDMA traffic (Queue 3) and lossy buffer pools for other traffic. Also, to schedule RoCEv2 traffic and regular TCP traffic on the same port, we use a DWRR (Deficit Weighted Round Robin) scheduler with different weights for queues at the egress port. See Fig 1 below.
Fig 1. Sonic buffer/scheduler configuration
To support DCQCN, the server NICs will enable ECN for RoCEv2 traffic (ECN set to “01” or “10”). On the SONiC networking devices, the WRED based ECN marking is applied for ECN enabled traffic at the egress queue, where the WRED “Min threshold” and “Max threshold” are defined to mark ECN (set to “11”), instead of dropping the traffic. The thresholds are used to define aggressiveness in marking congested packets (See Fig 2). The destination servers, on receiving the ECN marked packets, extract the flow information and reply with CNP packets to the source servers to rate limit the sending flow.
Fig 2. ECN marking on networking devices
For most online and offline IO-intensive applications, the DCQCN mechanism is sufficient to make most applications adaptive to the networking bandwidth without drops. However, there is also an additional option to leverage PFC (Priority Flow Control) to guarantee lossless networking. PFC is enabled for RDMA traffic (Priority 3) only.
Based on the buffer pool setting mentioned earlier, regular traffic could hit lossy buffers and traffic could be dropped in the event of congestion.
For RDMA flows, if bursty traffic hits the switches and fills up buffer queues such that they exceed the buffer XOFF thresholds, PFC is sent to the sender to pause traffic for that priority/class. By doing this, and with PG headroom reserved for in-flight traffic, the lossless networking is guaranteed for RDMA traffic. When the buffer queues fall below the thresholds, the switch will send resume messages to the sender.
On the other hand, if PFC frames are received on networking devices, the devices would pause for a specific time period based on the configured timer for those particular queues, e.g., Queue 3 for RDMA.
It’s important to detect and restore any PFC deadloct, which can be caused in some cases by a PFC storm. PFC watchdog is enabled in SONiC at LinkedIn. For more details, follow this link.
In testing this approach, we configured a lab of multiple target and client nodes with a single switch. To induce congestion, a single client node would run a 100% read load with an IO size of 16 kibibytes to 6 target nodes. In parallel, client nodes had TCP traffic generated between them via iperf. The loads were run twice: once with RoCEv2 disabled, and then again with RoCEv2 enabled.
When looking at the switch, we can gather the counters for each of the QoS queues and look at how the network reacted to the test.
Figure 3. Q0 (default queue) stats when RoCEv2 was disabled on the switch
The above figure shows each of the respective ports’ overall traffic and the specific Q0 traffic (default including TCP) for the participants of the test when RoCEv2 was disabled, with most of the traffic originating from port 49. While not shown on the graph due to the amount of sent packets, the large peak at 22:45 is where there were dropped packets. During this time, ECN and PFC were disabled, resulting in all traffic sharing the link and causing added congestion on the line.
Figure 4. Shows RDMA and CNP queues Q3 and Q6, respectively, for the number of packets sent by each port
Next, when RoCEv2 was enabled—looking at Q3 and Q6 for the RDMA and congestion notification packets, respectively—it can be observed that normally, traffic can coexist without impacting either of the flows. In the middle graph, the shape of the plot for congestion packets matches that of the sending RDMA traffic from port 49. When this is disabled, as we see in the bottom graph, we can observe at 22:30 that each of the flows now share the link.
This translates into a storage test achieving a read load of up to 2.1 gigabytes per second, which translates to around 17 gigabits on a 25 gigabit link, while ensuring no packet drops in the RDMA queue. This is in addition to the TCP traffic being generated by iperf, which occupied the remaining 6-7 gigabits per second of available bandwidth. However, with the setting disabled, it is observed that the throughput drops to 9-10 gigabits per second, along with dropped packets in the default queue due to congestion. In this case, the TCP traffic generated via iperf consumed 13-14 gigabits per second bandwidth on the link.
Impact of RDMA in the data center
We have been using ROCEv2 in our production ToR switches (local to the rack) supporting storage flash for over 18 months now, which has enabled us to disaggregate NVMe flash while scaling flash storage independently of compute, and thereby improve data protection and fault-tolerance. Over the next three years, this technology is projected to reduce storage cost-basis by 60% while increasing the NVMe flash utilization to over 75%. The use of this technology allows us to provide services with network disks that have the same latency and bandwidth expectations as that of locally attached disks. The use of Quality of Services within the network ensures that existing workloads using the new drives do not lose out due to congestion. With the progress of this project, we are expanding the RDMA support beyond the ToR and across the data center fabrics, allowing for the acceleration of storage and application traffic.
Special thanks to the stellar team for the tireless contributions: Vanessa Borcherding, Joe Lin, Eric Abad, Brent Jones, Arun Manohar, and James Ling. Many thanks to Shawn Zandi and Arun Manohar for reviewing this blog. This project would not have been possible without the continued support and investment from the leadership team in boosting productivity across all of infrastructure engineering: Shawn Zandi, Neil Pinto, and Michael Liberte.
A huge shout-out to all our partner teams who helped us in extending the technology to production including Microsoft and LinkedIn SONiC, Storage, and Database teams.