Building the SRE Culture at LinkedIn
May 15, 2017
Being a Site Reliability Engineer (SRE) means having to talk about hard problems. Site outages, complex failure scenarios, and other technical emergencies are the things we have to be prepared to deal with every day. When we’re not dealing with problems, we’re discussing them. We regularly perform post-mortems and root cause analyses, and we generally dig into complex technical problems in an unflinching way.
Strangely, talking about culture in an SRE organization can sometimes be much harder. At LinkedIn, we often discuss how our culture is just as important as our products. Yet it’s much more difficult to define a blueprint for teams and companies to create the right culture. If there were easy-to-follow steps, we may not see as many issues in tech with things like diversity and inclusion. As it stands, however, many companies want to create a positive culture, but aren’t always sure how to embark on that process.
A post-mortem on our SRE culture
I certainly don’t claim to have a one-size-fits-all solution or a template for creating the right engineering culture. However, two of the engineers on my team recently told me they planned to share how they feel that the LinkedIn culture is unique—that they feel valued and supported regardless of their background. This has caused me to reflect on the culture of our SRE organization specifically, for the simple reason that I know that it hasn’t always been this way. In this post, I’d like to address some of the changes we’ve made over the years to instill a positive, inclusive culture, and discuss the activities we do on a daily basis to maintain it. While this isn’t a definitive guide, hopefully some of the ideas and experiences can be helpful for others looking to improve or change their corporate culture.
Early years: fighting fires
In the early days of the SRE team, we weren’t even called “SREs.” Our role was more of an amalgam of release management, firefighting, and traditional operations. Our focus was exclusively on getting things done, and we didn’t have a defined culture to speak of. As we’ve discussed before, the LinkedIn site was plagued with reliability issues as it endured hypergrowth, and it was all we could do to keep the figurative lights on—we didn’t stop to think about the culture we were creating, technical or otherwise.
When things finally came to a head, we decided that we needed to make some serious changes to our team in order to fix the problems with our product. We reorganized ourselves as the SRE team, tasked with the clear goal of keeping the site up and running at all times. To align with this mission, we decided to wholeheartedly embrace the values of craftsmanship and ownership across all of engineering. This meant feeling responsible for the site as if we were its owners, and viewing our work as a craft that requires thoughtful execution.
To a large extent, this overhaul was successful. We got the site to a more stable place and pivoted the role of operations to solve problems via software, rather than people and process.
Dealing with culture debt
As SREs, we are always thinking about things like resilience, efficiency, automation, and the overall availability of our member experience. Tackling these issues almost always means working with other SREs or other teams in the wider engineering org, so we need engineers who value the importance of collaborating with others.
Because our technical situation had been so dire during the hypergrowth period, we had come to value technical skills above all else in our hiring and management processes. Instead of considering whether or not candidates were people that would be great teammates in the long-term, we put more weight on how their technical capabilities could help us in the short-term. While this approach netted us some very talented engineers, it also gradually revealed its flaws over time. Having people who weren’t good team players made collaborative work—an integral part of site reliability—much more difficult, and in some cases created a negative work environment. Eventually, the experience became painful enough for everyone involved that we realized that we needed to make another change. This process was very similar to how technical debt tends to build up in a long-lived codebase. Over time, we made specific changes to the people, philosophy, and process of how we run SRE at LinkedIn in order to solve for this “cultural debt.”
In 2013, we invested a lot of effort in evolving and formalizing our SRE interview process. Part of this included explicitly looking for the collaborative spirit we wanted our engineers to display, in addition to our high technical bar. Gradually, this began to swell our ranks with more and more people who fit the culture we wanted to build, and not just the technical prowess we hoped to accrue. By the time we reached this level of maturity in our hiring process, we numbered about 100 people on the SRE team—a far cry from the handful of individuals we started out with in the early days. As our organization has grown, the ability to successfully collaborate has only become more tied directly to our technical work. In retrospect, not focusing on this quality in new hires only worked for a while because we were a much smaller organization.
As we have grown, this focus on collaboration has been naturally reinforced by the work we do every day. The evolution in culture meant that those who embraced these collaborative and empathetic qualities naturally rose through the ranks, while those that didn’t often chose to pursue opportunities elsewhere.
David Henke, who was head of Engineering and Operations at LinkedIn for many of our early years, began promoting the mindset of “attacking the problem, not the person.” Our daily work as SREs is to constantly identify and fix problems and bugs, so remembering that we’re all on the same team fighting against site outages fostered a culture of inclusion and equality. The mindset became that an outage wasn’t “my problem,” but “our problem,” and that we were all in it together to fix the issue.
Now, our SRE team is comprised of hundreds of engineers in various geographic locations. Scaling culture alongside a team can be challenging, but a big part of what helps us is that our leadership is aligned regarding the environment we want to create. Everyone respects the importance of having a collaborative and inclusive culture, and so it’s a priority to maintain it. Part of the way we do this is by reinforcing our values in our daily stand-up meeting.
Every day, the SRE leaders, along with anyone else who wants to join, participate in a short meeting to go over the site reliability problems from the past 24 hours as well as the immediate preventative fixes we are implementing for each incident. As we discuss these topics, we make sure that we’re approaching the solutions not only from a sound technical perspective, but from a sound cultural perspective as well. For instance, if we see defensive behaviors, we remind people to attack the problem, and not the person. Or, if an outage was a result of a breakdown in communication, we take the time to re-emphasize that we’re all on the same team, and need to view each other that way.
I think a key part of these meetings is that culture is never made a separate point on its own—we always integrate it into the way we discuss the issues of the day (site outages, recurring bugs, etc.). The result is that doing your job properly from a technical perspective includes behaving in line with our cultural values as well; the two are heavily intertwined.
We don’t pretend to be perfect and realize we still have a lot of work to do, but hearing from some of my fellow SREs that they felt they were treated with respect and equality makes me feel like we are pointed in the right direction. Hopefully these examples—hiring for cultural fit, making cultural and technical values interconnected, and reinforcing these values on a daily basis—can help others create the culture they want to see in their own organizations.