From Good to World-Class: What Makes Software Engineers Excel at their Craft

This post was originally published on the LinkedIn publisher platform.

There are three criteria we use in performance reviews for engineers at LinkedIn: leadership, execution, and craftsmanship.

The first two are pretty obvious: we all are used to evaluating how well we get people moving in the same direction and how successfully we ship and get s**t done. But “craftsmanship” is far trickier.

I have had many discussions with people on my team recently about the concept, which we treat not just as a component of performance but as part of our culture. Some of the same questions have surfaced repeatedly: What does it mean to excel at our craft? Can we distill it down to a few key principles and software attributes? Can we turn those into practical considerations as we write the code that helps hundreds of millions of professionals be better at what they do everyday? Can we help the fresh-out-of-college engineer joining our team understand what it means to be a good craftsman?

To answer these questions, I developed a framework exposing seven fundamental dimensions of software craftsmanship. I consider each of them equally important for building world-class software and teams. They are:

It all starts with the code we write. We should always assume that in the future, someone else will take over our code. We then need to put ourselves in their shoes, asking ourselves if our code is clear and modular enough that they’ll be able to understand what the code is trying to do easily. An organization should make this easier by providing clear code quality guidelines for every language that is supported. These can make writing quality code standard practice.

I’ll always remember the wow factor I experienced the first time I opened a Mac Pro computer back in 2008. Everything inside was so well organized and laid out. The design was clean, and even the internal wiring was so impeccably lined up, that in no time I could see where each component was: the graphics card, the RAM, the motherboard. We should think of code as the inside of the Mac Pro. Our users don’t see it, but it should still look like one of the most beautiful and elegant things you’ve seen, because that makes it easier to be maintained and expanded upon.

Scalability can be thought of as the ability for a software system to remain functional when the load spikes. If the load on the system spikes the day after a holiday break (which is usually the case at LinkedIn), our products and services should be able to handle that. Building extensible and scalable systems has always been a primary concern for LinkedIn, which is why we have continuously invested in building massively scalable systems and components such as Kafka, a high throughput messaging system that allows us to route more than 800 billion messages per day at LinkedIn.

There’s more than just this, however—scalability is a broader concern. We must also think of scalability as a function of the organization. Building reusable components that will be leveraged by other use cases also contributes to scalability.

As LinkedIn grew up in popularity, so did the needs of our members and customers to analyze data at scale and in real time, such as the needs to understand how a LinkedIn ad campaign or a given set of job postings perform, based on hundreds of profile dimensions, such as Industry, Function, or Geography. We needed to do rollups and drill-downs across billions of rows and hundreds of dimensions. Instead of building a custom ad hoc solution, we decided to build an online analytics service to serve the needs of the wide range of products and services that LinkedIn offers. This is how Pinot, a real-time distributed OLAP datastore, was born. The ROI has been phenomenal as Pinot now supports all online analytics capabilities across LinkedIn’s products and services.

Our systems cannot go down. Ever. We have customers all over the world who rely on our service being available 24/7. If a single failure causes service interruption, we’ve failed. Losing a node should not cause a cluster to fail, and catastrophic failure like data center level error conditions should not cause an interruption of service either. High availability is such an important concern for LinkedIn that we have built Helix, a cluster management framework with high-availability built in. Helix automates reassignment of resources in the event of a node failure, a node recovery, or when expanding or reconfiguring such a cluster, thereby preventing interruptions of service when specific parts of the system are in error condition, or are being worked on for maintenance.

The Internet is a very unsafe place, and we owe it to our users to make our products and services as secure as possible. We all need to write software that does not introduce security vulnerabilities or jeopardize our users’ data. Correcting potential vulnerabilities early in the software development lifecycle significantly reduces risk and is way more cost effective than thinking about security after the fact and having to release frequent patches. Security needs to be built in. I’d love to give you examples of things we do to keep our site and data secure, and how every engineer at LinkedIn incorporates security requirements in the code they write, but this would be way too sensitive information to talk about in a public post.

Just like we should assume that our code will be maintained and extended by another engineer in the future, the systems we design should be as simple as possible for the task at hand. After all, other folks will have to maintain and operate it. There’s a quote by Albert Einstein that I like a lot: “You should make things as simple as possible, but not simpler.”

If we make things more complicated than they need to be by over-engineering systems, we create friction and impair the team’s ability to iterate on and enhance the software. This is a balancing act, for sure, since we need to build software to solve the complex problems at hand. But build in a suboptimally simple fashion and you introduces downstream friction for folks who operate it and, ultimately, for your users. One of the reasons we’ve been able to continuously enhance some of the most complex systems we’ve built at LinkedIn, such as the infrastructure we use to run machine learning algorithms at scale, or our auction-based ad system serving millions of queries per second, is our continued focus on making these systems as simple as possible.

A slow page is a useless page. At LinkedIn, we’ve proven through experimentation that degradations in performance have serious impact on our member experience, the health of our ecosystem, and also can hurt top key metrics such as signups, engagement, or revenue (not that we were expecting different findings, but we like to test everything through our experimentation platform to know the magnitude of the effects). Every single page or application should be instrumented for performance measurement. Tracking and fixing performance degradations as we introduce more features, products, and services is a very healthy habit. Every engineer must own performance as a key aspect of the code they write. This is why, leveraging our RUM (Real time User Monitoring) framework, every engineer at LinkedIn can (must) instrument the pages they build for real time performance tracking. Aggregate page latencies are automatically displayed in a user-friendly dashboard that allows you to see how fast a given page or application is loaded and rendered, and to slice, dice, and drill through performance measurements across devices (web, iOS, Android, …) or geography.

Operability is the ability to keep a software system in a safe and reliable functioning condition. We have several ways of instrumenting this at LinkedIn. One way is by measuring the number of Site Reliability Engineers (SREs) that are needed to operate our software on a node basis. Is it one, 10, or 100 SREs for a thousand nodes?

Other metrics we use include: How long does it take to bring an instance to life? If a service takes 15 minutes to boot, that is not very operable. If a service generates millions of meaningless exceptions per hour and clogs the logs, that’s not very operable either. To enhance operability, the best folks to engage with are the folks who operate the systems every day and every night.

Whether you’re a full devops shop, or whether you partner with an SRE or sysadmin team to operate your software, it’s always a good practice to ask operationally-minded folks what to do to make a given service more operable. The benefits are numerous, including: (1) it enhances your ability to recover from an error condition if your systems are optimally operable, and (2) folks operating the systems can spend more of their valuable time building new things that create further leverage as opposed to spending time trying to keep the system alive and serving.

Reflecting on my experience at LinkedIn going through years of hypergrowth has shown me these seven pillars as most vital for software craftsmanship. Understanding these is critical to enabling us to build faster, better, and with more stability.  However, perhaps the most important thing when it comes to craftsmanship—and the thing that unifies all these concerns—is to care. If you’ve read this article to this point, you most probably care. I would love to hear from you what you think of these seven pillars and whether you would add new items to that list.