Hiring SREs at LinkedIn
August 27, 2014
What are LinkedIn SREs?
Site Reliability Engineers at LinkedIn are responsible for architecting and operating a reliable, accessible site for our more than 313 million members. We operate over 450 independent services to bring you LinkedIn and are responsible for making sure these services are always available, worldwide, no matter what.
This is an awesome responsibility, and it’s one that requires a certain type of person. One with a unique set of skills, who….
(I don’t know where in the code you are, NullPointerException, but I will find you, and I will debug you.)
Well, maybe not quite that extreme. But it does need someone special. You not only have to have the skills to operate a site of our scale, but you also have to be able to write and read code to help debug problems and automate your work. When you have over 40,000 production servers to deal with, you can’t do anything one machine at a time. You’ve also got to be able to architect services to operate in an environment with multiple active datacenters, handling tens of thousands of queries per second, and that can stay up even if other services are down or slow. If that isn’t enough, you have to be able to talk to other engineers, management, your peers, and more – all while sometimes in the middle of troubleshooting problems that can create a lot of stress.
The process we use
Hiring these people is hard. Finding them is hard and then verifying they have these skills in a reasonable amount of time is even harder. In this post, I’ll talk about the typical hiring process for SREs at LinkedIn. It all starts with some phone calls. First off, one of our recruiters will call you and ask you about why you’re considering LinkedIn, what your availability is, and then will ask you some common knowledge systems administration questions. Assuming you pass those, we send you through to a code-focused phone screen with an existing SRE. During this phone screen, we’ll ask you to do some code problems that are mostly related to things you might do on this job, like log parsing or using a RESTful API.
Once we’ve determined you can code, you’ll do another phone screen with an SRE where we ask you about operational issues – things like how the Internet works, what you’d do in the shoes of a LinkedIn SRE to help scale the site, and how you would envision monitoring a site as complex as ours. Assuming that goes well, we have you come on site to meet the team.
How we make it interesting
Once you’re on site, the real fun starts. It’s usually a few hours long, but we pack a lot in. During your onsite, you’ll get to experience lunch at the office (each office with SREs has daily lunch provided) and meet one of the SRE managers, who will have you tell them war stories and otherwise share some data about who you are and what you’re looking for from your next opportunity.
We can’t let you get away without assessing your technical skills, and for most interviews we have 3 separate chances while you’re on site for you to do that. A big part of the interview involves a live troubleshooting exercise. This is as real as it gets as an SRE – you will be given a laptop connected to a running service, and will have to troubleshoot some problems with it. It’s just like being oncall, only with a buddy there to help you if you get stuck. (If you've never been on call before, you're in for a treat - but basically, being on call means you are responsible for your part of the LinkedIn infrastructure, 24/7, for a whole week! You have help, but you're the first point of contact.) During the exercise, you can use your favorite search engine, look at man pages, anything else you’d normally do to troubleshoot.
One of the cool things about being an SRE at LinkedIn is that any time some problem happens, you are quite likely the first person in history to troubleshoot that exact issue. Given the complexity of our stack and the interactions between services, every time is like the first time. That’s why we have the live troubleshooting module – we want you to experience what this is like!
In addition to the live troubleshooting exercise, another important part of being an SRE is triaging issues. Eventually, you’ll be on call (see above) and responsible for the operation of the site. When that happens, you might see several alerts all firing at the same time. We’ll ask you to prioritize a sample set of alerts and to talk about why you prioritized them in that way – and then we’ll ask you some troubleshooting questions. This activity is also designed to emulate what you really do on the job and to give you a taste of what it’d be like as a LinkedIn SRE. You won't have our awesome tools like inGraphs, but the activity is designed such that you won't need them.
We usually end by having you get in front of a whiteboard and talk architecture with us. A huge part of the SRE role is to be involved from the beginning with architecture and design decisions for new LinkedIn services. So, as part of the interview process, we’ll ask you to walk us through a standard large-scale website architecture. We’ll talk about strong points and weak points. We’ll challenge you with failures of components. We’ll ask about how to handle fault-tolerance, geographic distribution, cache-busting, and all the other things you have to consider to operate a site at the scale that we do.
SREs at LinkedIn have to move quickly to solve problems and we also move quickly after your interview. We’ll let you know within a few days (no weeks-long processes here) and then if all goes well, you’ll get to jump in and start helping us build opportunity!
Our engineering slogan is "Build Opportunity". It really encapsulates what we do here at LinkedIn. We create economic opportunity for every professional in the world.
Want to have LinkedIn as your next play?
If you think you can handle the demands of operating our site, and that the job and process described is interesting and exciting, we’d love to have you join our team! Please go to this link and submit your LinkedIn profile and you’ll hear from us. We have positions in Mountain View, Sunnyvale, San Francisco, New York, and Bangalore – so no matter where you roam, you’re likely not far from a LinkedIn SRE!