Getting to Know Todd Palino

Clark Haskins

Turning SRE Best Practices into Code

January 23, 2018

LinkedIn wouldn't be the company it is today without the engineers who built it. We have no shortage of talented individuals in technical roles across the company. They are the ones who create, build, and maintain our platform, tools, and features—as well as write posts for this blog. In this series, we feature some of the people and personalities that make LinkedIn great.

Todd Palino is a Senior Staff Site Reliability Engineer working on the Data Infrastructure Streaming team. This team is responsible for the streaming infrastructure that drives the heart of LinkedIn—moving data between the back office and the myriad components that make up the frontend. Todd’s focus has been on Apache Kafka, the publish-subscribe messaging system of choice for many large companies, which moves data for everything at LinkedIn such as application metrics, search, the feed as well as advanced machine learning.

Prior to joining LinkedIn in December 2013, Todd was a systems engineer at VeriSign for over ten years. There, he was responsible for managing hardware and operating system standards, including developing build and management systems. Before that, he worked at AOL, where he created a system that made it impossible for spammers to use AOL to harass the rest of the internet. He received his bachelor’s degree from The George Washington University in computer science in 1997.

What are some of the coolest projects that you and your team have been working on?
Because Apache Kafka was originally developed at LinkedIn, we have a very strong Kafka development team. Both the Kafka development and SRE teams are very focused on open source work. As part of the SRE team, I took time to develop Burrow, which is an advanced tool for monitoring Kafka consumers. This solved a problem that LinkedIn had with how to monitor our applications, but because I was able to release it as an open source project, it’s also become a staple for many organizations outside of LinkedIn. We’ve recently completed a rewrite of the project as the 1.0 release to make it even easier for the community to engage with and improve it, something that benefits both the community and LinkedIn.

Burrow is one example of our projects that focus on the operability of Kafka, with others like Kafka Monitor and Cruise Control getting significant response as well. I’m also working on releasing some significant performance improvements to Kafka Mirror Maker, which is used for replicating data from one cluster to another.

What other projects are you involved in outside of Kafka?
Of late, I’ve been shifting my focus to look at problems that affect SRE in general and how to improve the quality of life for all of our engineers. For example, I’ve been heavily involved in the revamp of our processes for incident management, with the goal of making them more consistent and allowing us to move more quickly from problem to mitigation to resolution. I’m also starting to look at the use of another technology that LinkedIn is well-versed in, machine learning, to vastly improve site operations.

What made you first want to be a site reliability engineer?
Ever since high school I’ve been working in some aspect of operations—managing school computer systems in both high school and college, and working as a systems administrator after that. My work always included an element of creating tools to make my job easier, so SRE interested me as soon as I was introduced to the concept. It seemed to be a better description of the work I was doing, and it focused more on the thing I enjoyed best: strategic and proactive work to automate tasks.

What is the most challenging part of your job?
We have an excellent engineering team across the board, and the Streaming team in particular has a strong focus on operability and stability. Unfortunately, this means that we don’t have a lot of “easy” problems to work with anymore. Most of our problems end up requiring a significant investment of resources, in terms of both development and SRE, to solve.

Compared to other places you've worked, how do you like working at LinkedIn?
LinkedIn encourages us to take risks, and this is quite different than my previous positions. Everyone is constantly on the lookout for a new project or an area where we can pick up a little more performance with a change. And while we certainly don’t want to take the site down, mistakes and failures are not penalized. As a team, we are always learning and improving.

We’re also encouraged to engage with the larger community outside of LinkedIn, and this has resulted in a transformation of my career since I started here. I have had the opportunity to speak at numerous meetups and conferences on both Apache Kafka and on the practice of SRE in general, and I’ve even co-authored Kafka: The Definitive Guide.

What are your favorite things to do when you’re not at the office?
Especially with my travel schedule for work, most of my non-work hours are dedicated to my family. My daughters, wife, and I all love musical theater (Hamilton is a permanent fixture in our house right now) and all things Disney. If we’re not at Walt Disney World, or on a cruise, we’re planning our next trip.

On my own, I love to go out for a run. I’m on a bit of a break right now, having overdone it a little with eight marathons (and numerous other races) in the last 5 years, but I’m looking forward to getting out for some shorter runs and races as the weather warms up.

Topics: Culture