Data Center Learnings: What Others Can Learn From Our Experiences

Neil Pinto

VP, Infrastructure Engineering @ LinkedIn | Skilled Technology Leader

December 8, 2015

Co-authors: Michael Yamaguchi, Kelly Shea, Zaid Ali, and Sai Sundar.

LinkedIn is growing at a rate of two new members per second, so our infrastructure and data center systems have to expand to match. Thousands of servers and virtual nodes around the globe make it possible for you to search for new connections, reach Influencer articles, and participate in group discussions. But it wasn’t always like this. When LinkedIn started, we had approximately 20 servers hosted in our first data center in 2008. Today, we have four data centers around the globe, with at least two more coming online in 2016, which was announced in our most recent data center blog post. Here’s a look at how we grew a multi-colo operation, and some of our learnings along the way.

From 2008 to 2012, we were in two colocation data centers in traditional active-passive mode, and faced a few key issues: only read scaling was possible due to hardware underutilization in one data center; time and resource-consuming failover process; maintenance had to be pre-planned; scaling was bound by data center capabilities, resulting in perpetual fire-fighting for the team. These challenges inspired the team to use off-the-shelf software to go into an active-active mode: true multi-colo. This project was completed in late 2013 and has allowed for increased flexibility for operations and engineering teams, without compromising our user experience.

When we moved into the wholesale arena in 2013, our mantra became “Manage to the Load.” If we are only consuming 500kW of power, then we should only be running our backend supporting infrastructure enough to maintain that demand. By monitoring our power consumption, temperature, and humidity, we are able to closely work with our service provider to adjust to the conditions, thereby reducing energy costs.

We’re also on a quest to reach an annualized Power Usage Effectiveness (PUE) of 1.2 or below for our new data center coming online in late 2016 . In our existing facilities, we’ve implemented basic cooling methodologies including hot/cold aisle configuration, blanking panels, aisle doors, that help us maintain an average annualized PUE of 1.55. While we consider this acceptable for the regions and current mechanical designs, we’re striving for 1.2. We’re investing in a Rear Door Heat Exchanger (RDHX) solution, which neutralizes the heat closer to the source, thereby reducing our energy costs in transferring that heat to a cooling medium at a further distance.

Equally important to our data center strategy are our Points of Presence (POPs), which are small-scale data centers that cater to carrier and network interconnections and often house edge content such as CDNs. POP location is critical: there are only a handful of locations around the globe that allow for interconnection between carriers and networks. LinkedIn developed a POP selection machine learning algorithm called SCOUT which uses member information to determine where to build the next set of POPs. Currently, we have 15 POPs around the globe where members’ TCP connections are terminated closest to them from a geographical and network hop perspective. To serve content as fast as possible to members, you usually have to be in large metro areas where interconnection and peering happens, but there is a space, power, and cost challenge. Globally distributed POPs allow us to decouple the compute nodes from edge content services so that we can scale large compute nodes at lower cost where power and space are the determining factor. With the support of LinkedIn’s global backbone we are able to connect our data centers to POPs that manage the traffic between edge content and back-end services, helping us to get data to our members with more than 25% page download improvement.

We still have a long way to go in our data center evolution. In the next couple of years, we’ll be adding POPs and data centers, while continuing to lower our PUE, increase our speed, and provide a more resilient service to our members.

Topics: Data Management