Project Falco: Decoupling Switching Hardware and Software

February 1, 2016

Three years ago we had a serious latency problem with applications inside our data centers. We were not scaling our network infrastructure to meet the demands of our applications – high speed, high availability and fast deployments. We knew we needed greater control of features at the network layer, but we hit a roadblock on figuring out how.

The Production Engineering Operations (PEO) team found it very difficult to meet the demands of our applications when network routers and switches are beholden to commercial vendors, who are in control of features and fixing bugs. A year ago, we launched a project called Falco focused on decoupling network hardware and software. The Falco working group put 11,520 hours into developing our first network switch, Pigeon, which enabled this control. Pigeon will be deployed at a larger scale in our next-generation data center design in Oregon.

Pigeon is a 3.2Tbps switching platform that can be used as a leaf or spine switch. Pigeon is our first foray into active switch software development. We are not venturing into developing our own switch because we aspire to become experts in the switching and routing space, but because we want control of our destiny. We continue to be supportive of our commercial vendors and work with them in a decoupling model.

Journey into developing our switch platform.

This all started when software engineering teams brought a problem to the PEO team’s attention. These teams were seeing high latency in their applications inside the data center. It was a challenging problem where none of the traditional logging mechanisms revealed anything conclusive. Several network engineers worked tirelessly to solve it, eventually discovering that it was a microburst problem. A microburst occurs when rapid bursts of data packets are sent in quick succession, leading to periods of full line rate usage overflowing packet buffers on the network stack. It is a hard problem to detect because the buffers are inside third party merchants’ silicon chips and not entirely exposed by commercial switch vendors.

When we concluded that microbursts were the cause of the high latency seen by the application, we tried very hard to find ways to predict the short burst buffer overflow. But the more we looked at it, the harder it was to come up with a simple and elegant solution.

We then started to look at the problem from a different perspective. What if we had the ability to get all the telemetry data from the switch’s merchant silicon chip? The vendors we buy switches from do not expose the telemetry information nor do they provide read/write access to the third party merchant silicon. Our only avenue for solving the microburst was to rely on our switch vendors to provide solutions, which we found were not timely in our fast-paced environment.

We then started to look at further challenges we faced with commercial supplier based switches besides telemetry. These included:

Bugs in software that could not be addressed in a timely manner
Software features on switches that were not needed in our data center environment. Exacerbating the problem was that we also had to deal with any bugs related to those features.
Lack of Linux based platform for automation tools, e.g. Chef/Puppet/CFEngine
Out-of-date monitoring and logging software, i.e. less reliance on SNMP
High cost of scaling the software license and support.

We looked at the challenges encountered by vendor-based switches and the question we asked ourselves was “What is the ideal switch we want to have, where we have more control of our destiny?” We came up with the following capabilities we wanted in our switching platforms:

Run our merchant silicon of choice on any hardware platform
Run some of the same infrastructure software and tools we use on our application servers on the switching platform, for example, telemetry, alerting, Kafka, logging, and security
Respond quickly to requirements and change
Advance DevOps operations such that switches are run like servers and share a single automation and operational platform
Limitless programmability options
Feature velocity
Faster, better innovation cycle
Greater control of hardware and software costs

When we started to look at the kind of switch we wanted to have, the microburst/buffer problem which got us thinking down this path was just a single dimension. Having control of programmability on switches opened up a world of possibilities in our ambitions of a programmable data center.

In the last two years, we have been observing a shift in the space of hardware and software disaggregation. The term disaggregation was coined because the Original Device Manufacturer (ODM) market opened up its hardware to anybody who wished to buy the switches they manufactured and was no longer exclusively manufacturing switches for name-brand commercial switch vendors. This meant that a content company operating at scale could buy a switch from an ODM supplier and put any software on it. This also meant that you can work directly with multiple merchant silicon chip vendors and have full access to programming the chipset (e.g. Trident, Trident II, Tomahawk).

Almost a year ago, we started to look at developing our own switching platform. Using the guiding principles mentioned above, we came up with our base architecture, depicted in the diagram below.

The application layer is where we focus on using server-based tools already existing as part of LinkedIn’s infrastructure. LinkedIn tools is an array of infrastructure automation tools used to manage configuration and automation. Auto-Alerts is a monitoring and alerting client tied to Nurse, an auto-remediation platform. The application layer on the switch also lends itself to supporting the Kafka client. Kafka is a publish/subscribe messaging pipeline system that we heavily use for metrics. A telemetry client interfaces with the merchant silicon SDK to obtain advanced buffer statistics

Pigeon takes flight to production!

At LinkedIn, we do canary releases of software, where we push code changes to a small number of hosts. The goal of the canary test is to ensure that the code changes made are transparent and work in the real world environment. It isn’t different when it comes to infrastructure. After three months of lab testing, we isolated an environment in production and put a number of switches to canary test in production, running live member-impacting traffic.

The architecture of the switch is based on the latest Tomahawk 3.2Tbps merchant silicon, 32X100G.

Why did we call it Pigeon? We use Birds of a Feather (BoF), a concept where engineers get together to solve problems and propose solutions. This inspired us to use different types of birds when naming our upcoming switch platforms.

Future work

We will continue to advance project Falco in 2016. We are also interested in switch platform vendors supporting ONIE and giving access to their ASIC and merchant silicon so we can run our software applications on their hardware platform. Switch Abstraction Interface (SAI) support is also in our 2016 roadmap.

Acknowledgements

Pigeon is based on the efforts of the project Falco working group in PEO at LinkedIn. Special thanks to Shawn Zandi, Saikrishna Kotha, James Ling, Sujatha Madhavan, Navneet Nigori , Yuval Bachar and program managers Vish Shetty, Fabio Parodi.

Topics: Optimization Data Management Infrastructure