LinkedIn’s Approach to a Self-Defined Programmable Data Center

Shawn Zandi

Network Engineering at LinkedIn

March 2, 2017

Co-authors: Shawn Zandi and Russ White

This post originally appeared in Network Computing.

Operating a large scale, rapidly-growing network requires a philosophical change in how you plan, deploy, and operate your infrastructure. At LinkedIn, as we scaled our data center network, it became evident that we needed to provision and build networks not only as quickly as possible, but also with the most simple and minimalistic approaches possible—something that that previously was not quite apparent to us. Adding a new component, a new feature, or a new service without any traffic loss or architectural change is challenging.

The three core principles that have guided our infrastructure design and our strategy are:

Openness: Use community-based tools where possible.
Independence: Refuse to develop a dependence on a single vendor or vendor-driven architecture (and hence avoid the inevitable forklift upgrades).
Simplicity: Focus on finding the most minimalistic, simple, and modular approaches to infrastructure engineering. Apply RFC1925 Rule 12 to our network and protocols literally—“perfection has been reached not when there is nothing left to add, but when there is nothing left to take away.”

To the three dimensions above, we recently added a new one: Programmability. Being able to modify the behavior of the data center fabric in near real time, without touching device configurations, allows us to tune the operation of the fabric to best fit application and business requirements. This allows our network operations and site reliability teams to focus on running the network, rather than on managing tools and configurations, and thereby unlocks further innovation. Programmability brings benefits like being able to prioritize traffic distribution, load-balancing, or security posture when needed, on-demand with minimal effort, and also increases agility and responsiveness in delivery.

To accomplish these goals, we are disaggregating our network, separating the hardware and software in a way that allows us to modify and manage the network without intrusive downtime, and moving to a software-driven network architecture.

Single SKU data center

In our recent blog post about Project Altair, we explained our move to a single SKU data center model, specifically based on the Falco open switch platform. We use one hardware design, a 3.2 Tbps pizza box switch, as the building block for all different tiers of our leaf and spine topology that operate on top of one unified software stack. LinkedIn data center parallel fabrics can be depicted in different colors, with each leaf switch providing a path to a particular fabric as seen below:

Parallel fabrics

This single SKU data center enables us to move away from the complexity of large chassis-based boxes to a simple singular module that is repeatable and can be increased in quantity as we scale out. Building a simple fabric, however, does not remove the complexity entirely; it simply moves it to another location in the network. In our case, that location is the software stack that rides on top of a few standard Linux distributions—a portable control plane that can be run on hosts and routers, anywhere in the fabric, and provides us with the ability to separate hardware scaling from software features, etc. The hardware platform is grounded in Open19, which improves our rack integration through snap-on power and data speeds that are up to 2-3x faster than our current generation of hardware.

Top of Rack

Pigeon 1.0 hardware – 3.2T silicon
10/25/50/100G server attachment
Each cabinet: 96 dense compute units

Leaf

Pigeon 1.0 hardware – 3.2T silicon
Non-blocking topology
1:1 over-subscription to Spine

Spine

Pigeon 1.0 hardware – 3.2T silicon
Non-blocking topology
To serve 64 pods (each pod 32 ToR)
100,000 bare-metal servers: approximately 1,550 compute per pod

On the other side of the disaggregation divide, features and controls are moved to the code, as opposed to complex, specialized hardware devices.

Self-defined networking

We are working on a new concept: how we would like networks to be built. As we connect systems and network elements, we would like the network to simply work.

Self-defined networking is a series of out-of-the-box features and functionalities that enables a network element to initialize and build dynamically with no preplanned configuration or human intervention. Network elements discover and define their role and function in an automatic and self-driven fashion. Once a switch is wired (regardless of tier), it should start functioning with not only minimum, but zero configuration required.

The first step for a network element is to find its placement, role, and function in the network. It must find itself and its location in the topology in order to start pushing packets on the wire. While most protocols can negotiate adjacencies, and carry policy through the network, location awareness is a crucial new feature that will allow our fabric to largely self-configure.

Traditionally, networks rely on an out-of-band network to provision and perform basic setup via a console, or on an out-of-band Ethernet network to grab a set of configurations from an external location via DHCP and TFTP, and use that configuration to discover the intent of the operator and to provide some form of identity. Configuration is usually prepared in advance by a set of scripts that provide some addressing and set unique values in a template for the device to start functioning.

On the other hand, a self-defined network, once wired, immediately programs its tables and starts forwarding. It does not require any pre-configuration or any static mapping arrangements. It discovers its adjacent neighbors, registers itself in the inventory system, updates a central repository with required information, and moves from a fully dynamic mode to a registered and deterministic mode.

In search of a better control plane for data center fabric

LinkedIn data centers, just like any other hyperscale data centers, are a collection of servers and intermediate network devices that are connected via a series of point-to-point links forming a Clos topology. Currently, we utilize autoconfigured link-local addresses to establish a control plane to route both IPv6 and IPv4; hence, no IP configuration is required to be prepared or planned on switch interfaces. We would like to make our control plane support self-defined networking and start forwarding once devices are racked.

The requirements for such a control plane are pretty simple and straightforward:

Fast, simple distributed control plane;
No tags, bells, or whistles (no hacks, no policy);
Auto-discover neighbors and build RIB;
Zero configuration;
Must use TLVs for future, backward compatible, and extensibility;
Must carry MPLS labels (per node/interface).

In addition to the above, if we consider the physical topology, or how the network is wired, as the intention and desired state of a self-defined network, we would like to make sure that the applied (current) state is well-discovered and carried to detect any wiring or physical misconfiguration that does not follow the expected pattern.

We will publishing IETF drafts in the near future outlining these basic concepts, so that the entire community can both enable, and be enabled by, this work.

For those interested in hearing further thoughts on this topic, co-author Shawn Zandi will be speaking about “LinkedIn’s Approach to the Programmable Data Center” at Interop ITX on Friday, May 19 at 9 a.m.

Topics: AI Open Source Data Management Infrastructure