The Makeup of Successful Geographically-Distributed SRE Teams: Part 2
March 27, 2018
In part one of this series, we discussed some of the key principles to consider when developing geographically distributed (GD) SRE teams. Similar to the first article, we’re leveraging the journey of LinkedIn’s SRE team as the point of reference for the topics discussed here in part two. Within this post, we’ll discuss growth planning, the challenges associated with being part of a remote team, and some of the unexpected advantages geographically distributed SRE teams can offer.
Tenets to focus on when planning growth
Quality of work: This deserves a special mention as it is at the core of sustaining a successful GD team, yet is a principle most challenging to achieve. Technically capable engineers need continuous learning to keep themselves motivated, so driving technically challenging work from both HQ and non-HQ teams is necessary. Automation and coding tasks that require collaboration with other teams should not be restricted to HQ. Moreover, it is important to drive work with visibility and high business impact from remote teams to enhance the careers of remote engineers and to prove the viability of running a remote team.
It is always better to involve remote teams from the early stages of building a product so as to make the most out of the team; this, in turn, ensures the satisfaction of the remote team. There are plenty of examples in the industry where remote teams are meant for oncall and maintenance only. This demotivates the engineers, and attracting and sustaining top talent with this kind of work is not possible.
Oncall: Timing is critical for this one. As mentioned in part one, oncall is an obvious benefit of having GD teams, so there is always an eagerness to start oncall from the remote location early, especially some form of 12(hours)X7(days) oncall. It helps people sleep when necessary and come back with fresh minds to work during business hours. However, it is necessary to be mindful about starting oncall support from remote site and not rush into it. Give the team enough time to ramp up and capture details of the product. Otherwise, oncall will become mostly about picking up a phone and calling the person with expertise. Have enough people on the team to make sure oncall rotations aren’t too frequent, as that can take a toll.
The alternative to this is limited hours oncall, for example 8X5 shifts if the need is urgent. This system can also prove to be a good starting point for any team, as it lets them ease into a full-fledged oncall rotation. Teams will find oncall is a good mechanism to learn things on the fly, so pushing it too late isn't ideal.
Travel: In part one, I mentioned the importance of facetime with non-local teammates. As the team grows, this still remains important. It helps for remote engineers to travel and spend time working with teams at HQ. Beyond that, something that may prove to be beneficial with a larger remote team is for people from HQ to travel to remote offices. This helps all the remote team members achieve knowledge transfers at the same time and keep up with changes going on at HQ.
Consider a scenario in which your team needs to work on a project which relies on a lot of coordination with an individual or two from HQ. If most tasks of the project are assigned to individuals from the remote team, it could be beneficial to have that individual from HQ travel to the remote location and spend time with the team to get the project going. This is a more efficient use of resources and helps foster great working relationships.
Mindful growth plan: Success with initial setup will require a strong growth plan in place for more products under the org to be onboarded for support by the remote team and more engineers to be hired for the remote team. Be mindful not to get carried away with this. A few things of importance are not to add too many people too fast. This may result in infrequent and inconsistent work for the team members. At the same time, do not onboard support for too many products too quickly. It may seem exciting at the beginning to the team but may also result in only partial support for the affected products for a significant duration of time, as complete product ramp-up could take a long time. Should this occur, it could affect the remote team's productivity and their ability to contribute quality work.
Org/Team structures: Different support models and org structures work for different teams. Attempting to copy org structures from HQ to remote teams may not always be the right step forward. Careful contemplation and retrospection at various steps in the process of building the remote team will help determine which engineers and teams should support which products. Start with a model that seems right and evaluate if it works. If it does not, try a different model. For example, if having dedicated teams for different products in HQ makes sense, having one remote team support more than one product could work better.
When starting out, remote teams spend less time compared to HQ teams dealing with clients, escalations, and other unplanned work and therefore have more time, which should be used to create crafty engineering solutions to help all SREs with their day-to-day operational work. This helps the remote team learn the operational challenges and the HQ team deal with them easier. As mentioned earlier, having a retrospective every few months (or any regular interval that suits you) and tracking resource utilization can help make these calls; just keep in mind that following same org structures in both worlds is not necessary.
Finding the “right” talent both at HQ and remotely: Talent is a major driver of success, so it’s crucial to keep the bar high. Keep in mind that the “right” talent means someone with the right technical ability who is also a cultural fit for the organization. Someone with the technical capability but lacking cultural fit could disorient the team and cause discomfort to other team members. However scarce the kind of resources you are looking for, making a compromise on talent is too risky and may disrupt the harmony of an established working team. Once the remote team is established, make sure the candidates (no matter the office location) are aware of the fact that there needs to be active collaboration across all teams, no matter the geographical distance, and accept the associated challenges.
Challenges facing a remote team
How to work around them and, in some cases, turn them into advantages
Lack of local developer support (especially for infrastructure teams): As mentioned earlier, a lot of organizations start developer support in geographically distributed offices, but GD dev teams interact heavily with the company’s infrastructure for their everyday work. The infrastructure (monitoring, data, platform/cloud infra, etc.), which is set up early on in the company’s journey, is mostly located at HQ, so infrastructure SREs in remote locations face the challenge of having developers only at HQ and hence in a different time zone. This presents a few challenges:
Constant reliance on email/documentation for day-to-day information: As remote SREs, it becomes really important to have a constant grasp on the emails and documentation. A lot of information regarding ongoing challenges, JIRA tickets with information about various issues and ways to fix them, and status on long running projects, are passed on in emails or in wikis, so those come in handy while debugging oncall issues. A lot of changes happening in the system and the reasons behind them are also communicated in emails and then documented. So staying on top of those and relaying the information in them to the local team is a challenge and cannot be the only source of communication between the teams. There are a couple ways to help with this potential problem:
Handoffs: As much help as emails are, they still lack a sense of urgency. There needs to be something quick and targeted, that only buzzes during urgent issues. This is where “Handoffs” come in. They contain all relevant information about the day’s oncall challenges to keep the offshore team up to speed. They also contain a detailed description of tasks/escalations that need action from the offshore team. On the collaborative instant messaging tools we use, we have specific channels for oncalls, which only buzz the handoff information described above to the relevant engineers. Similarly, there are other channels for specific projects that need to stay running 24/7. People responsible for the projects subscribe to those channels so they are buzzed on those updates.
Dedicated time to read email: It helps to set up dedicated time on the calendar to read emails so that everyone can get caught up and can contribute suggestions about changes in the system being discussed over email. Letting the number of unread emails increase too much causes one to lose sight of some important changes only to later realize that they could have made a suggestion during the discussion earlier on to help the cause.
Time-suitable weekly syncs: Every team has their weekly meetings where topics like burning issues, top priorities, progress on current objectives, etc. are discussed. It is important to have members from both sites attend this, so the meeting should be at a time suitable to both teams. This usually involves compromises from both sides and the hours can be switched from time to time to ensure the compromise is fair.
Increased debugging time: There are several incidents that are escalated to the SRE teams which require in-depth debugging. The SREs will likely spend significant time diving into the code to determine the problem, as they don’t deal with the code day in and day out and therefore won’t be as familiar with it. With the SRE teams in HQ, this can be solved simply by walking over to the developer handling the code and having a quick chat to get the answer. It might take the remote SRE three hours to find that same answer if there is no local development team. There are some benefits to this setup, though:
More frequent code deep dives: Remote SREs develop a habit of diving deep into code and staying close to it, which is always an advantage when it comes to contributing to the code.
Helping align with change in SRE culture: There has been a recent drive in the industry to bring SREs closer to the code. SREs contributing to the code helps make the product better in terms of scalability and operational ease. Frequent deep dives for debugging will make SREs more comfortable when interacting with code.
Difficult to participate in the design stage of features and products: An SRE can make significant contributions to the design of a product when it comes to the scalability and operability of the product. As an SRE, it is also important to be aware of the design of the product you’re responsible for, especially for infrastructure SREs, as that comes in handy when giving advice to clients and colleagues regarding infrastructure usage and debugging site up issues. As the developers handling design are typically in HQ, though, this becomes a challenge because of timezone differences. There are a couple ways to work around this issue:
During the initial days of remote setup, encourage the SREs to read design docs and wikis. It is important that as a remote SRE reads them, they ask as many questions as required over email to understand this inside and out. Once understood, it is important to provide comments on things that seem concerning or things that can be improved.
As time progresses and the remote team is well established, it is a good idea to schedule separate design reviews between the developers who came up with the design and the remote SREs to streamline the process and reduce the back-and-forth over email.
Dependent vs. independent projects: One thing that is a constant question in the world of remote teams is choosing between projects that can be done in isolation versus projects that involve collaboration with teams at HQ. Sticking exclusively to one or the other has its downsides. Constantly assigning isolated work to remote SREs could make them lose touch with ongoing developments and changes in the product, whereas constant choice of collaborative work can cause too many follow-ups and blockers for them to make progress. Instead, having a good balance of both for every major planning cycle is a good middle ground. Employing this method allows team members to perform independent tasks when they are blocked on a collaborative task which needs input from someone in a different office location.
Hidden advantages of being on a remote team
Comparatively more time available to contribute to craftsmanship: Remote SREs have been known to have more creative time at hand during non oncall weeks compared to HQ SRE teams. There are two factors that contribute to this:
Infrequent escalations: Until the time the organization becomes such that the number of engineers are equal in HQ and the remote office, there are always more people using a particular product in HQ. This means there will be more client queries and escalations for HQ teams to deal with.
Non-peak traffic: For several internet companies, most of their user base (during the initial years) lies in time zones close to the HQ. This results in peak traffic hours being the same as HQ business hours and not during remote business hours. Therefore, issues related to capacity arise in HQ office hours more often than they do during remote business hours.
This extra time can be utilized in multiple ways:
Solving recurring operational problems via engineering: The extra available time can be used to identify recurring issues which require manual labor and the time needed to track down a solution. These can be addressed using automation and other engineering alternatives. This helps the team in both HQ and non-HQ offices spend less time doing manual tasks and thus frees up more time for engineering initiatives.
Cross-team collaborative efforts which need engagement and time: There is a unique opportunity in this free time to get people from different teams together and get data on the most common problems across the organization that would require input from everyone to attack the problem. If done right, this can prove significant in providing exposure and motivation to the non-HQ team and can drive a lot of value for the company.
Establishing an identity for the non-HQ team: If there is one thing that comes forth in the success story of the SRE team at Linkedin Bangalore, it’s the leadership team’s effort to establish an identity for the team, which is that of an innovation center. With the advantages mentioned above, the team recognized the opportunity to establish an identity and drive value for the company at the same time. While innovation was already being driven from the team here, a more concrete effort was made by establishing an innovation team, which consists of members from various teams coming together as a virtual team to brainstorm ideas that could affect organization-wide productivity.
Since its creation, the team has selected and implemented a full-fledged solution to reduce MTTR and MTTD for certain site issues and is still going strong on its way to start implementing another solution to address a different problem. This effort from the team here was recognized by all of the SRE leadership and was well appreciated across the SRE organization, thus proving beneficial and advancing the image of the Bangalore team as an innovation center. These efforts prove beneficial in motivating the remote team, establishing an organization-wide presence for the team, utilizing remote resources better, and driving value for the organization.
The program continues to gain traction, as different members have since joined and contributed to the program while others have temporarily stepped out to fulfill pressing team objectives. This is a testament to the fact that this program is flexible in the way it is built and can benefit from and utilize all resources from the remote office.
Make mistakes, learn, and keep improving
In conclusion, it should be stated that things didn’t always fall into place for the team here; mistakes were made, lessons were learned, and those learnings resulted in corrective actions along with protocols to make sure mistakes aren’t repeated. The things mentioned above are what worked well for us and are mentioned in hope that they would help others by providing guidance at the beginning and during times of turbulence. Again, different things work for different organizations. Do share with us if you read something you don’t agree with or something that did or didn’t work for you. We would like to hear alternatives to these ideas and new ideas to make remote SRE teams work better, because these teams aren’t going anywhere anytime soon.
The successful growth of the geographically distributed SRE team at Linkedin is a proud achievement of all its talented engineers and an incredible leadership team. I would like to extend a special thanks to Hardik Kheskani and Amit Balode, who guided me through writing this post with their valuable reviews and content.