Culture

The Makeup of Successful Geographically-Distributed SRE Teams: Part 1

Why geographically-distributed SRE teams?

In today’s hyper-connected technological world, there is a need for geographically widespread technical teams to facilitate global growth. Businesses that scale to this level need global teams to handle that reach. However, as development teams continue to ramp, it quickly becomes infeasible for them to solve operational problems while also scaling the product. That’s where SREs, whose primary job is to develop software to solve operational problems, come in. Investing in geographically-distributed (GD) SRE teams is key to achieve the goal of scaling a business or product to a global audience.

Wide adoption of an idea comes about because of its advantages, so one could wonder, what are some advantages of having GD SRE teams? Which companies should think about trying this? What are the challenges while bootstrapping a GD team? What are the key elements of making GD teams work together successfully? This post will answer these questions and more with the help of the LinkedIn SRE team’s journey as a reference. Although some principles discussed here apply to all kinds of GD teams, this post is centered around GD SRE teams.

The idea of having GD SRE teams is alluring considering the nature of an SRE's work. No matter what time an issue arises, there's always an SRE team in some time zone in the world ready to respond during their normal business hours. A new geographic location for business also gives access to a new and broader talent pool to address the specific combination of skills (operations and software engineering) required for the job. That said, successfully bootstrapping a remote SRE team is challenging and is followed by a bigger challenge of coming up with and implementing a growth strategy that benefits both the exposure of the team and the careers of the individuals that form the team. The latter is specifically challenging for GD teams and needs to be considered for the long-term sustainability of the team.

This two-part series shares our journey through challenges associated with bootstrapping GD teams and what worked well for us while facing those challenges. Within this series, we’ll also attempt to answer some key questions anyone thinking of starting a remote team would have and strategies around the growth of these teams once established. The answers and strategies discussed here are based on experiences: some on my own experiences as someone who began working in a newly formed remote team here at Linkedin, and others on the leaders’ who were responsible for making the LinkedIn SRE team at the company’s Bangalore office (8000+ miles from the HQ) a success.

Motivations and deciding factors when starting a remote SRE team

Tap into worldwide talent: As the amount of work done by software increases, the number of people required to write and maintain that software grows. I am sure every hiring manager out there is looking for the right talent for his/her team at any point in time, but the supply of local candidates for a job saturates beyond a point. Starting a remote team allows for you to hire the best talent possible, no matter where HQ is located.

Local infrastructure and support: Various companies have bootstrapped remote “software development” teams in different countries for a variety of good reasons. These development teams sitting in different countries need infrastructure and platform support in order to be productive. There are several areas within the purview of SREs upon which developers depend. The deployment infrastructure, monitoring infrastructure, the data infrastructure, platform support for systems on which applications are run in production, and handling capacity at scale for various applications receiving different amounts of traffic are a few good examples.

Developers in remote teams will often find themselves needing help on one of these during their everyday development but will have to wait for help for 12 hours (in the best case). On top of that, communication over email takes a toll on productivity. This warrants bootstrapping SRE teams in those remote locations as well. Over the years at LinkedIn India, we’ve seen how having local SRE teams helps in terms of better developer productivity. When developers in the Bangalore office of LinkedIn were asked, “Does having a local SRE team help you be more productive at your job? True/False. If true, how?” Gopal Venkatesan said, “Yes, quicker turnaround time for any issue/help/consultation,” and Kaushik Srinivasan noted, “True, because you can walk over to someone's desk and get someone to look at issues, rather than wake somebody up in the middle of the night.”

Continuous execution of time-critical projects: In spite of all the good planning, there do come occasions in every organization where projects come up which are severely time-critical, more often than not due to business reasons. At such times, having a GD team enables 24-hour execution and faster completion. We’ve found, as an example, that safely (with limited parallelization) running hotfix deployments across 120 decently sized clusters in four different data centers took about two and a half weeks with teams from Bangalore and Sunnyvale working together, as opposed to taking 4+ weeks when only one of those teams is performing the task. Note that this requires a good handoff strategy (I will discuss this in detail later in Part 2).

Site-up support, employee satisfaction: This one is a well known benefit of having GD teams. With GD teams, it becomes possible to keep the site up without keeping people up beyond their normal business hours. This brings satisfaction to both sides. Your pager doesn’t go off in the middle of the night and that is a big advantage. This, in our experience, has shown improvement in MTTD and MTTR, as people working in business hours are usually active and wide awake.

Diverse ideas: A positive side effect of having a GD team is that you have people from very diverse backgrounds with a wide variety of areas of expertise and experience. This by default helps make products in the org better, with more perspectives from design to execution.

These benefits answer the question, “Why have a remote SRE team?” If you notice, though, they also almost answer the question of when to have a remote SRE team. Here is a more direct answer to “when” in the words of someone who made this choice, Hardik Kheskani, Senior Manager, SRE, at LinkedIn’s Sunnyvale office:

“The timing is critical, as the organization has to have the culture and maturity to scale to a geographically-distributed structure. One key aspect to timing is when you have subject matter experts at the company headquarters who realize the benefits of distributing the knowledge and expertise for the greater good and better, longer term turnaround for the business beyond just having warm bodies for oncall. Secondly, there is a need for similar skill, talent, and cultural fit in the GD locations, as they are on the receiving end and need to be as committed (if not more) to making the program an all-around success.”

Tenets to focus on before initial setup

Now that you have decided that a GD team would help drive your organization’s progress, it is really important you think about a few things before even drafting a plan to make it happen.

Buy-in from the HQ leadership: It is important that the managers and senior managers buy into the idea of having a GD team because it takes significant investment from them to make this work successfully. It’s not easy when trying to work with the additional number of meetings on one’s calendar (the times for which will be during non-business hours). The number of additional direct reports and their career paths will also require the responsibility of facilitating the team with hiccups (there are quite a few) initially. Therefore, set up meetings to explain why you think it is a good idea to have a remote team. Remember, if executive leadership feels otherwise, it may be that it’s not the right time yet. Unless they are on board with the idea, the GD team venture cannot be successful.

Alignment on goals: This ties into the previous point, but it is necessary that the leadership also agrees on what they are trying to achieve with formulating the remote team. The goals and expectations for the remote teams must be aligned and set. It is understood that eventually one would like both the remote and the HQ teams to be running with the same expectations, but starting with smaller, well-planned goals helps make it an iterative learning process to ensure that business needs are on the right track. Set up quarterly goals for the team on the lines of ramping up the team on company processes, technology, stack, and product, regular assignment of project work to the team, including the team in oncall rotation, and so on. Note that this timeline is your vision for the team, and leadership needs to agree on this.

Process goals and timelines: Once the vision is set, it is time to lay down the plan for bootstrapping the team. Note that the pillars of the plan are not technical goals but structure and communication goals. These goals should include hiring a manager/lead, hiring the first IC, and establishing frequent sync ups between the managers/leads of remote and HQ teams. Create timelines for hitting these goals, as they can easily slip through the cracks.

Buy-in from HQ team: I mentioned buy-in from the leadership earlier, but it is equally important for the team at HQ to buy into the ideas as well, because real coordination and sharing of work is the responsibility of the ICs on the team. It is very important to set expectations with the team on the ideal collaboration mechanisms (these will be discussed later in greater detail) because it takes effort from both sides to get this right. The benefits of GD teams mentioned earlier should be brought to the table, and they certainly will help with getting everyone onboard.

Tenets to focus on during initial setup

First hire: Now begins the execution phase of bootstrapping the remote team. The importance of the first hire cannot be stressed enough. Your first hire is someone that executes and propagates the vision remotely. Every subsequent hire follows his/her lead to understand the job, the expectation, the protocols, and many other aspects. Ideally this is someone who would go on to take a lead role or would eventually go on to manage the team. It’s important for this hire to focus on areas like building relationships and trust with peer teams, being a domain expert on the products being supported, and driving work which is collaborative in nature in order to stay engaged with HQ teams.

Talent: This is the most important pillar on which the success of the venture depends. Just like your first hire, it is necessary that you hire the right talent with the right attitude to make a successful venture. Collaboration is key in such settings, and any cross-team collaboration will require adjustments. These adjustments are not only related to time zones but also include adjustments on working style and communication, which necessitates being extremely patient and overly supportive of feedback and knowing that sometimes tone over non-direct mediums can be misjudged. Setting expectations right from the start is ideal, so use interviews to incorporate this. Ask about previous remote experience (that sure is a plus), and ask about challenges faced in case they have previous experience to gauge their critique of the idea of GD teams and people they worked with. Definitely ask about strong preference for meetings and/or work hours. Dissatisfaction later in the game becomes a challenge. Talk about the various collaboration mechanisms established (this will be discussed in depth later), and ask for ideas that they have around this. Great ideas may come out of this exercise, and if nothing else, they always amount to healthy discussions.

Travel: Speaking from personal experience, putting faces to names (not on conference calls) helps change the dynamics of working relationships. It helps build trust and confidence and eases communication. Imagine responding to an email from someone you have spent 10+ days working with in office versus a colleague you have never met in person. Similarly, it helps witnessing someone’s work style to understand them better professionally. Make sure such visits involve social relationship building with your teammates/colleagues/clients as well, as opposed to just executing projects. The benefits of this need no description.

Evaluation/Directional retrospection: It is important to set up meetings to evaluate the progress on the roadmap as well as to accept feedback if the course taken so far differs from the one that was planned. This helps with course correction, and sometimes the feedback helps formulate better plans; on other occasions it helps avoid further losing sight of the goal. If a few changes are necessary, the earlier they are identified, the better. The feedback will show some things working well and whether increasing focus on those may help deliver further productivity/execution. Large-scope projects which span beyond a quarter and are driven mostly in isolation are the ones which sometimes deviate from the original goal due to lack of communication, upcoming features, etc. In such cases, it’s usually better to phase out the project and retrospect it with dev/SRE partners and make it better for the next phase.

Autonomy versus dependency: Any task that comes up in the team can be classified into either being dependent on resources from HQ versus independent. It is always easier to assign independent tasks to resources in the non-HQ team, as there will be fewer roadblocks with getting these done on time. As much as that is true, it is important to have dependent tasks assigned to non-HQ resources early on so as to build and establish methods of collaboration, trust, and camaraderie. The more these teams interact and work with each other in the beginning, the better it will be for the team’s growth and building expertise. Eventually this leads to task assignment mostly independent of location but based instead on expertise, which results in speedy achievement of goals.

Summary

As shared within Part One of this two-part series, there are many important aspects to consider when thinking of developing GD SRE teams. From getting leadership buy in to hiring the right talent, it’s a multi-step process but hopefully this post provides guidance if you’re thinking of undertaking this challenge. Please stay tuned for Part Two, where we’ll discuss scaling the teams after successful formulation, challenges that come with GD teams, and how those challenges can be converted to advantages or at least faced head-on.