Rebuilding messaging: How we designed our new system

Tyler Grant

June 15, 2020

Co-authors: Tyler Grant, Armen Hamstra, Cliff Snyder

Over the last five years, the number of messages sent on LinkedIn has quadrupled. Whether it’s to connect with former colleagues or discuss new opportunities, messaging is core to how our members connect on the platform. As LinkedIn grew, our original email-based messaging system expanded and evolved to not only support this increasing volume of messages, but also onboard new features. With each new feature, the complexity of our system began to snowball, slowing down developer experience and speed—what previously took a week to implement began to take months. It was clear a new architecture was needed, but the path to get there was hazy.

Re-architecting our messaging platform was a major initiative to unlock a better experience for members with greater flexibility and innovation for our engineering and product teams. We’re excited to share how we did it with a new blog series focused on both the cultural and technical aspects of this rebuild. This series will touch on key learnings throughout the process, including common issues for growing companies transitioning from a focus on feature development to engineering at scale beyond a single data center. To kick off the “Rebuilding messaging” series, we’ll be focusing on the design phase in this post.

Where we started: An email-like system

Let’s take it back to 2013: our original messaging product resembled email rather than the chat experience we have today and Inbox was a monolith, running in a single data center with “One Big (Oracle) Database.” There was a simplicity to this architecture that made it easy to reason about. For example, if the database experienced an outage, it was easy to answer the question, "Which members were impacted by this outage?" (Answer: ALL of them.) At the same time, however, we began spinning up additional data centers and migrating toward a NoSQL distributed data storage solution. All of this led to one of the first major changes to Inbox as we introduced a sharded architecture

The migration was long and fraught, and the resulting architecture brought a new set of complexities. The idea was to split the database into N shards (starting with N=2) and distribute members' respective inboxes roughly evenly between them. A service called Personal Data Routing (PDR) was introduced to act as the source of truth for which shard contained a given member's data. As we built out more data centers, we put a replication strategy in place so that each shard would have one replica in another data center. Replication was bi-directional, and switching between replicas could be done with a single command. This arrangement had its benefits: if a particular data center was having issues, we could fairly quickly mitigate the impact by switching to the other replica, and there was opportunity for performance improvement by virtue of being able to move the data closer to the member.

The next major change to come to Inbox was a product redesign in 2016, introducing many of the features we have today, including group conversations, threaded conversations, emojis, and press-enter-to-send. The system was rooted in its email architecture, but was trying to provide a chat-like experience. It resembled a hybrid between the two, but did neither particularly well. Nonetheless, it was a step forward in product evolution and led to architectural enhancements that made the product look and feel more responsive.

diagram-displaying-the-old-messaging-system

Our old system created a copy of each message for every participant in a conversation

Fast forward a few years and many iterations. “Inbox” was renamed as "Messaging," and incremental feature additions and enhancements made it a completely different product than it was before. Yet it was still being built atop a more-than-a-decade-old codebase and data architecture designed for one-on-one email-like conversations that made small changes to this codebase prohibitively difficult.

Requirements for a new system

Defining requirements for a new architecture is hard. Technical blockers are frequent, and coming up with requirements must take into consideration both company size and culture. Competing pressures to develop new features along the way is also inevitable. A successful project must have requirements that balance the long-term vision with an ability to execute in a reasonable amount of time.

At a high level, the new messaging platform had to scale with not only the growing volume of messages being sent, but also the addition of new developers and features. We made sure our initial set of requirements satisfied both of these major issues as well as various additional pain points that had cropped up over the years.

Product requirements

Deliver messages grouped in conversations accurately, privately, and quickly.
Provide a rich message format.
Search across both conversations and messages.
Be easy to use as a client.
Allow for custom decision making hooks about whether to send a message.
Allow for custom tracking and logic after a message has been sent.

Engineering requirements

Ensure that the system is available 99.99% of the time regardless of maintenance or migration efforts.
Keep costs for storage and delivery as low as possible.
Be globally available across multiple datacenters.
Provide low latency access regardless of the mailbox size.
Be modular, such that individual functions can iterate at their own pace.

diagram-showing-the-new-centralized-messaging-system

Our new system created a single centralized copy of each message in a conversation, regardless of the number of participants

Key requirements
What might not be obvious from the above list is the critical role of custom business logic during the migration process. The existing system had a significant amount of code that was only required for a single use-case. We ended up creating nearly 60 separate converters for custom pieces of business logic that had been built into the old system—this took six months of reinterpreting highly specialized code.

The end result of all this custom business logic meant that messaging was a tangled mess. The code was owned by the messaging team, but the logic and reasoning behind it were owned by separate teams around LinkedIn. The logic then became more difficult to change because another team was responsible for all decisions around a particular feature. Developers started to be overly careful and eventually afraid of making changes. It began to take months to make what should have been a simple change, such as adding a new use case.

From a more technical standpoint, this complexity ended up causing problems as we attempted to scale across multiple datacenters. For example, the database contained a shard key that could not represent new types of participants in a thread. In addition, multiple places across the codebase had to handle older conversations with the same thread ID as a special case lookup. It also became very difficult to change even small things in the business logic, let alone something as critical as the primary key to a message or thread.

In response to these database scaling issues, an early important requirement for our new architecture was to split aspects of data persistence across multiple services so that each service could manage its own tables and scale independently. While beneficial in the long run, there were some tradeoffs. For example, it took more time and more developers to build these services, and it was no longer possible to handle most writes in a single database transaction. We had to embrace distributed databases and distributed development even more fully than we had before.

Our approach

The approach we took handled technical, organizational, and historical issues. Beyond solving immediate technical issues, we wanted to address long-term problems in the production environment. Software evolves quickly and is molded to fit an organization, and this was readily evident in the Messaging code. Our engineering culture is extremely collaborative, meaning that people don’t work in silos, and the growing needs of our business necessitated special case products. Ultimately, we had to find a way to separate basic messaging functions from special-case rules determined by partner teams.

After we examined our own system and a few outside messaging products, we found ourselves at a crossroads: Do we iterate on what we have or do we create something completely new?

The answer was not clear-cut. We decided that an iterative approach would likely leave compromises in our system that would still be present today, so we decided to pave the way forward in creating something new, focusing on the following benefits:

Faster iteration on interfaces and stored data without production constraints
Ease in enforcement for fundamental architecture changes, such dependency chains and primary keys, from an organizational standpoint

To kick off work on our new system, we had to set the high-level vision and a consistent set of best practices. A large system redesign requires developers to be able to make as many independent decisions as possible without compromising quality, and we needed to ensure that work was done in parallel as much as possible.

To handle all of this, we followed these distinct steps:

Write a high-level architecture document that laid out major entities and services
Find leads for each of the major services and divide the work among their teams
Decide on a set of design principles through a joint discussion with the entire team
Empower leads to work independently and in parallel

Design guidelines
An architecture will only last as long as the teammates maintain good design principles. Once the team strays, it’s easy for a future tech lead to come in and request a rewrite. Good architecture lasts significantly longer if the team working on it is aligned on philosophy. Going beyond the initial design, the team that owns the architecture in the long term must also be well-versed in these principles so that they can maintain a high standard. One of the common problems with large projects stems from short-term thinking when structuring the team. It’s critical to build the team for the long term if you want the resulting system to last—this is why we set up a joint design principles team meeting across stakeholders from the start (the results of which are included below).

Before outlining actual design principles, the messaging team was tasked with collectively answering two questions:

Why should we have design principles?
What does it mean to have design principles?

The resulting answers were then divided into four major categories.

1. Efficient research and discovery

“Understandable architecture”
“Easy to figure out what the service does.”
“Services should do one or two things well.”
“Don’t repeat things. Less is more.”
“Maintain consistency by following similar conventions.”
“Do what you say in regard to both systems and people.”

2. Distributed decision making

“Help us make decisions.”
“It’s the thing that avoids meetings!”

3. Faster development velocity

“Allow us to easily make changes.”
“Let us add more developers to the project.”
“Deploy quickly.”

4. Easier operations and maintenance

“Clearly defined service criteria and latency metrics.”
“Let us scale the system as traffic grows.”

In sketching out categories of improvement, it was clear that tackling these four buckets would greatly improve every stage of the development lifecycle.

First, before making any changes, a developer must figure out what’s going on in the system. Good design principles create systems that do not require specialized knowledge across a large number of different components.

Second, once equipped with this context, a developer must decide what changes to make. If there is only one architect to be consulted for every decision, that one person becomes overwhelmed and creates a bottleneck. Good design principles allow for independence across the entire development team.

Third, as a product becomes more successful, people will request more features. Good design principles allow more developers to implement features without compromising overall agility.

Finally, with a success feature launch comes more traffic. If a system cannot scale for all users, developers will have to spend time fixing it so that it can handle the load of the business. Good design principles allow systems to scale without compromising performance.

Priorities
Design principles should typically never conflict. However, if there are contradictions, we had a simple set of priorities to guide the Messenger team. The priorities are listed in descending order:

Correctness. The system must function as advertised and not make mistakes. This includes data corruption, data loss, delivery failures, and user interface issues.
Build it right. The architecture must scale gracefully with traffic, features, and developers.
Make it fast. The operations performed by the system must have low latency and high throughput when asynchronous.

Messaging design principles
Here are the set of design principles for Messaging separated into four categories. These rules are set up to help achieve the goals outlined under “Motivation” and are subservient to the values in “Priorities” if they were ever to conflict with one another.

General

Single source of truth. There is only one source of truth for data, code, or knowledge. Any assumptions about how another system behaves falls under knowledge. Avoid structuring things that make implicit assumptions about how other systems work and only rely on the explicit contract.
Specialization. Every service, component, and function does one thing and does it well. It exists because nothing else provides the same functionality and that functionality can be described in one or two sentences.
Ownership. Every piece of code and every service has at least one clearly defined owner. Owners are the designated experts and have final say in what happens in their codebase.
Limited scope. Each service, class, or function should reference things in terms that it understands. It is impossible to control clients, so avoid coupling any client knowledge to what is being built. Never design a system that forces a service to understand more than it can control.
Accurate names. Name services, components, classes, and functions with a common vocabulary. This will help developers understand what something does at a glance. It also helps developers learn a new area of code more quickly. Name things for the underlying concepts they represent and do not hardcode things that may change.
Documentation is always outdated. Assume that documentation is out of date as soon as it is written. Prefer self-documenting code with concise and accurate names. Use documentation to convey the author's intent and the reason for why the code is written in a particular way. Convey knowledge that will be useful for the next developer when they try to understand the system that was built.
Automate. Do not waste time repeating manual tasks.
Murphy's Law. “Anything that can go wrong, will go wrong.” Assume everything will fail and design with failure in mind. This is especially true with network calls on mobile clients, but applies more generally to every level of engineering.

Interfaces

Concise interfaces. An interface should not have more than six parameters. More than six parameters means that the interface is trying to do too much all at once.
Orthogonal parameters. Parameters to interfaces should not have invalid combinations that allow for contradictions. The user should not be able to make mistakes with a given set of input parameters. Each parameter should have a job that is independent of the other parameters.
Immutability where possible. Treat objects as immutable where possible and strive to avoid in/out parameters.

Services

Isolation. Each service is the owner of its database and no other service accesses that database. This allows developers to quickly iterate without the need to coordinate between services. Downstream clients that depend on the database are inherently asynchronous and can be changed independently.
Security on day one. Build in security from the beginning.
Error reporting. All errors should be logged and reported. Logs should not contain personally identifiable information, but should otherwise be helpful and accurate.
Monitoring. If it is not monitored, it is probably broken.
Efficiency. Ensure that people do not wait for machines. Prefer consistent low latency for online systems and high throughput for offline or asynchronous systems.
Service level agreements. Services should adhere to the interface contract and a well-defined latency limit. Clients should not have to verify that a service is producing the correct output.

Coding

No hidden side effects. Do not write functions that contain side effects.
No long functions. Functions longer than 50 lines are probably trying to do too much. Refactor them into smaller pieces that have well-defined scopes.
Localized knowledge. Group related functions in classes and related business logic in services. This makes it easier for developers to discover and diagnose issues.

Lessons learned

Every large-scale rebuild initiative will have bumps along the way even with thorough planning. We ended up making several pivots that fell into the following categories:

Migration strategy. Due to a data migration catch-22, we had to pivot our initial strategy to maintain and preserve the integrity of our member data. It was not possible to migrate only new messages because they depended on and modified thread-level information which was not yet certified as correct.
Technical choices. We realized that we needed a way to guarantee asynchronous processing beyond just a fire-and-forget approach. Some long running operations needed to always complete successfully to provide consistency across different views of the data.
Team structure. Even though the traditional manager hierarchy did not change, the engineering ownership shifted. We pulled in engineers from our partner teams in a joint push to accomplish this task and ownership shifted from a manager team structure to one more based on individual engineering knowledge.
Project organization. As the project evolved, we moved from specific service owners to larger technical areas to finally reorganizing into project tracks with individual leads. We were able to maintain a high craftsmanship bar throughout this process due to the design principles mentioned earlier. One note of which we are particularly proud: We saw 24 engineers become leads for each of their tracks and execute efficiently for 6 months.

If not for the groundwork we had set early on, it’s doubtful that we could have pulled this off and smoothly cut over to full production traffic. Our design principles and team culture allowed us to scale with the size of the project as well as pivot and adjust to the problems at hand without finding ourselves stuck on specific decisions or specific people.

Overall, we were able to apply technical and organizational lessons to create a foundation for the next messaging platform. At times, the road was neither easy nor obvious, but our design principles acted as our north star. Investing early on in our goals and team organization not only helped us push through the more challenging aspects of the project, but also paid dividends in the end. If there is a single takeaway from our journey, it is that the project succeeds when the entire team is aligned to follow best practices and empowered to make good decisions.

Topics: Architecture Product Design Infrastructure