Project Altair

Shawn Zandi

Network Engineering at LinkedIn

March 21, 2016

LinkedIn's infrastructure needs to seamlessly handle petabytes of data on a daily basis. Our data centers and infrastructure have grown by 34 percent on annual basis and almost half a billion people in more than 200 countries and territories rely on us.

In order to reliably deliver our services to our members and customers, we have expanded our data center footprint over the past few years with new facilities in Virginia and Texas. This year we’ll have data centers in Singapore and Oregon come online. Today, we'd like to tell you about how we're changing our approach to build more sustainable, next-generation technology centers.

Traditionally, LinkedIn data center networks were designed around large chassis-based boxes purchased from well-known vendors. They relied on forklift upgrades to increase capacity once the next-generation hardware became available. Deploying and managing several generations of these designs, however, convinced us enterprise data center design was not the solution to our massively scalable infrastructure needs. When it comes to design and architecture, size and scale really matter!

In order to support our distributed applications, we needed a highly-responsive distributed networking infrastructure that could be scaled and would grow horizontally without changing the fundamental architecture, or interrupting the core of the network. This led to the creation of project Altair, a massively scalable data center fabric.

Once we reached the conclusion that scaling up a fabric based on large chassis devices would not support our needs, our engineering team began looking at alternative architectures. We decided to implement a design that would instead scale out horizontally and rely on equal-cost multi-path routing (ECMP) to distribute traffic across the fabric while traversing the fewest silicon chipsets possible to improve I/O throughput and reduce latency.

The new data center fabric was based on the following principles:

Simple and minimalistic yet non-blocking IP fabric
Multiple parallel fabrics based on Clos network architecture
Merchant Silicon with least amount of features
Distributed control plane with some centralized controls
Wide multi-path (ECMP)
Uniform chipset, bandwidth, and buffering
Low latency and small buffering requirements

From an application perspective, the LinkedIn network is simply a L3 Fabric with no overlay. Since we run our own code, we never needed to run hypervisor – or machine-level virtualization to abstract infrastructure from application, hence we have no L2 or VM-based mobility requirements. These services also did not need to be emulated by encapsulation or tunneling. We are proud of what we do as well as what we don't do. We moved high availability from the infrastructure to the code and placed a single Top-of-Rack (ToR) non-redundant access switch in every cabinet.

Adding to these hardware requirements, we also wanted to ensure that we natively support IPv6 on the fabric and have dual stack IPv4 and IPv6 as externally-facing protocols. This set the stage for IPv6-only data centers in the future. We also did not want to have any middle boxes that would slow our application down or add complexity into the mix.

Specifically, this means we don’t have any load balancers in the fabric—or even on the edges of our network. Instead, we rely on anycast to load balance IP traffic with extending BGP to host stack.

Examining the application and business systems we needed to support, we came up with the following requirements the design needed to accomplish:

1:1 oversubscribed (non-blocking fabric)
Use the minimum number of chipsets to carry east–west traffic
Ability to support 100,000 to 200,000 bare metal servers without adding an additional layer
Fabric should support up to 64 pods, with each pod consisting of 32 cabinets and each cabinet, 96 bare metal dense compute units
Fabric to be limited to three tier switching (5 stage Clos) for the whole data center to minimize the number of chipsets, lookups and switching latency.
Support host attachment at 10G, 25G, 50G and 100G Ethernet

To support these requirements, we began with a true three-tier Clos switching architecture with leaf (ToR), spine, and fabric layers to build a data center network. Each cabinet switch or ToR has access to four fabrics and can select a desired path based on application requirement or network availability.

Parallel fabrics can be depicted as different colors with each spine switch providing a path to a particular fabric:

Based on physical topology, once a color is selected at the origination for an application flow, the subsequent packets of that flow stay within the same colored path and do not change the channel. The fabric that the packets of a flow will travel—in other words, the color—is chosen at the origin, enforced by the first network element, and stays the same for the subsequent packets of that particular flow. Fabric selection can be random, based on a hashing algorithm, or deterministic, based on a criteria specified by the control plane or the application.

For example if leaf switch 1 at pod 1 uses its blue uplink to send a packet toward a destination through the blue spine, the packet will be delivered to the destination host via a blue spine at the destination pod. The color of fabric won't change for a packet in transit. This ensures that five stage switching is the maximum for any delivery within the fabric.

Rather than asking why we’re not using a chassis device, the right question would be why use a chassis when it is not required? Each network device in the above illustration is a single rack unit (Pizza box) 3.2 Tbps single chipset switch that can handle 32 port of 100G or 64 ports of 50G with a switching latency of sub-400 ns.

Using a single chipset switch provides faster switching performance with a dedicated control plane per chipset to simplify forwarding controls and troubleshooting of the network. The same switching SKU is being used at all three tiers whether ToR or a fabric switch to simplify network build as well as the switching software compatibility matrix.

Chassis switches introduce inherited complexity with multiple chipsets sharing the same control plane software. Long software upgrade cycles and slow boot up process that is usually non-deterministic to control the order of operations whether line card boot up or software initialization for different protocols adds to the complexity of running a data center core based on multi-chipset single brain technology.

At the heart of “One Big Fabric,” which consists of multiple parallel fabrics, is the Falco open switch platform. Built by LinkedIn engineers, Falco operates the single chipset switching architecture. Falco is powered by a distributed control plane based on BGP and management plane tasks of audit, log, telemetry, tracking using Kafka pipeline.

The new data center network design was a collective effort of many talented individuals across LinkedIn Production Engineering Operations (PEO). Special thanks to working group participants: Zaid Ali Kahn, Yuval Bachar, Thomas Cho, Shane Connor, Christopher King, Saikrishna Kotha, Prashanth Kumar, Michael Laursen, James Ling, Leigh Maddock, Sujatha Madhavan, Trevor Matthews, Navneet Nagori, William Orr, Brad Peterson, Kyle Reid, Jacob Rose, Chintan Shah, Nitin Sonawane, Andrew Stracner, Mike Svoboda, Russ White, Paul Zugnoni, program managers Vish Shetty, Fabio Parodi, and many others that spent countless hours reviewing, sketching and erasing designs on the boards.

Topics: Security AI Open Source Data Management Infrastructure