Lessons Learned from LinkedIn’s Data Center Journey

February 1, 2018

Editor’s note: LinkedIn Engineering VP Sonu Nayyar recently gave a talk on the evolution of LinkedIn’s data center strategy at DCD Zettastructure Singapore and then again at DCD Converged Hong Kong. The LinkedIn Engineering Blog team has asked Sonu to share the summarized lessons learned from his talk.

I recently had the pleasure of speaking to groups of data center executives and strategists at the DCD Zettastructure conference in Singapore and at the DCD Converged conference in Hong Kong. Putting my thoughts together for these talks gave me a rare chance to step back and reflect on LinkedIn’s amazing infrastructure journey and our accomplishments over the past eight years.

Over the last several years at LinkedIn, our steep growth in members, sessions, and engagement has generated an exponential need for data storage, compute and network infrastructure performance, and data management at a scale that none of us had experienced previously. Using our guiding principles of craftsmanship and transformation as a foundation, we learned from our challenges, and accomplished what seemed unimaginable to us at the time. To that end, I’d like to share some key lessons we learned.

“Site-Up” is priority #1

I joined LinkedIn in 2010, and on my first day at work, we had a site outage because our infrastructure could not keep up with the growth of traffic on our site. Not to be outdone, the next day, we had another outage! Coming into a new role, it was a bit of a shock to learn that site issues were occurring frequently. Like any company experiencing significant growth, we faced the difficult balance of weighing innovation and the deployment of new features versus site reliability. At LinkedIn, one of our values is “members first,” so we knew we had to optimize on speed while growing our business with our member experience remaining top of mind.

We chose to buckle down and rebuild team capabilities around our primary principle to always put members first. Keeping our site up became the number one operating priority, and in the first few months, this meant being able to react quickly and putting out fires. We then focused on two key areas:

Culture and team: We put in serious effort around craftsmanship by growing the technical skills of our team but also by hiring the best talent we could find. Most importantly, we instilled a culture of accountability, ownership, and transparency.
Tooling: We put a big emphasis on implementing a proactive set of tools that significantly improved infrastructure resiliency and capacity. We also added new dimensions to our monitoring and alerting so our Engineering & Operations teams could react and anticipate issues much faster than before.

We tied everything we did to our member experience, and “site up 100% of the time” became a core part of our identity.

Infrastructure at scale

Two years later, we hit another inflection point. Our site continued to grow at a rapid rate, so we were constantly adding data center capacity. At that time, our data center was hosted with a retail data center provider, and it became challenging to scale on demand in a cost-effective manner.

During a quarterly business review with senior leadership, we were asked to explain what would happen if we suddenly had to scale our infrastructure for an unanticipated event, such as a new feature going viral. The terrifying answer: we would have a site capacity issue and fail to deliver on supporting our members. The outcome from that review, though it felt like a crisis at that time, gave us clear direction about the future of infrastructure at our company. In order to build a cost-effective and sustainable business, we needed to change our data center strategy. Instead of relying on third-party data center vendors, we needed to operate and manage our own data centers. It was a pivotal decision and created a new principle for us: “control our own destiny.”

Another cornerstone of our emerging strategy was multi-colo. The goal was to improve member experience and reliability by being able to serve our applications from multiple data center sites. This, in turn, would allow us to sustain multiple types of system failure without the site going down for our members. The combination of building our own data centers and architecting our application to serve traffic actively from disparate geographic locations gave us a solid foundation on which to build for the next stage of our growth.

Innovate for hyperscale

As we executed on our data center strategy of increasing reliability and improving the member experience, we continued to experience massive growth. We opened a new data center every year from 2013 to 2015, planning a new data center build even as the latest one was just coming online.

However, now that we had a stable foundation with our multi-colo approach, we realized that there might be other ways to keep up with our “hockey stick” growth in both members and traffic. Since our site was more reliable, we were finally able to switch from a culture of “site-up” to one of “taking intelligent risks” and exploring other ways to meet the needs of our scale. To this end, we decided to build our next data center for hyperscale—one that could scale 10x what our existing facilities delivered.

Innovating for hyperscale was not just a technical change; we needed another cultural and organizational shift, too. This led to a different way of approaching our problems by focusing on end-to-end optimization rather than individual issues. We formed a cross-functional team to research and determine our next-generation design. This team was given four guiding principles: unlimited bandwidth, compute on demand, programmable data center, and scale cost-effectively.

The team had the autonomy to make decisions working along with the rest of the infrastructure teams. That research led to the design of Project Altair, a highly-responsive data center fabric that could be scaled horizontally without changing the fundamental architecture of the network, or interrupting its core during upgrades.

Another milestone in our innovation journey arrived in late 2016, when we opened our Oregon data center, implementing the innovative data center designs from Project Altair.

Finally, we were deploying the right infrastructure for our future needs—something completely different, more scalable, and responsive.

By learning from the challenges we had encountered, we came up with a model that we were proud to share with the industry. We’re now in the process of upgrading the rest of our data centers to the new standard set by the Oregon data center.

Looking ahead

As exciting and unimaginable as the last eight years have been, those of us in the data center strategy community have a strong hunch that the next eight years are going to be even more dramatic.

To meet the increasing demands of our site, we are continuing to build on our solid foundation and to incorporate forward-looking ideas and new innovations into our data centers. One of these is Open19, a new multi-company effort we recently launched to make data center hardware more interoperable and efficient. Other ideas we are exploring include OpenFabric, a better control plane for the data center fabric, and self-healing infrastructure. We are investing in software-driven infrastructure where we can leverage our vast telemetry and apply machine learning techniques to predict failures and auto-remediation. As always, we plan to share our experiences and projects with the larger community, with the hope that others can benefit from our lessons learned.

We look forward to working on behalf of our more than 530 million members worldwide to ensure that their needs are not only met, but also anticipated by our data center strategy. When we get it right, we help bring LinkedIn closer to our mission to connect the world’s talent to opportunity. We can't wait to see what's next!

Topics: AI Data Management Infrastructure