Rebuilding messaging: How we built for extensibility
August 20, 2020
In the previous blog posts of our “Rebuilding messaging” series, we shared the process of how we designed the system from high-level product and engineering requirements, and how we bootstrapped the data. In this post, we’ll explore why we made extensibility a core aspect of our messaging platform, what that meant for our partner teams, and how we got it done.
Investing in platform extensibility is a significant undertaking. It adds complexity; good design requires time to be spent analyzing and gathering input from the internal teams at LinkedIn that rely on the messaging platform. Therefore, we needed to evaluate the problems we intended to solve and how we can balance the needs of the platform with the needs of our internal stakeholders.
What we were solving for
- Preserving the quality of the member's inbox: Messaging on LinkedIn has always been a core function for members building their networks by having conversations with their connections. However, we also need to account for other use cases, such as recruiters contacting potential candidates or members reaching out to vendors reaching out to prospects. Therefore, we needed to implement business rules to make sure our members only received the most relevant messages in their inbox.
- Protecting member privacy: LinkedIn has always held a core value of protecting member's data and privacy. We’ve implemented strict compliance rules with persisted data that ensures we purge member data when they close their account. Furthermore, messages are considered highly confidential, so we implement additional security protocols for teams within LinkedIn that handle this data. This means that for partner teams to integrate with the messaging platform, we require a security review and additional access controls before enabling a plugin to access message data.
- Service ownership and flexible business logic: One of the major challenges with the old messaging platform was the ownership of the business logic, and how business logic was implemented directly in the same code base. When our partner teams in LinkedIn needed to implement a new use case, they would typically modify the platform code and database schema and over the course of 17 years of service, these use cases become a tangled web of interactions making it extremely difficult to maintain or add new functionality.
Principles for the new platform
We started with a set of principles and then developed a framework that partner teams could leverage for new use cases without needing changes directly on the new messaging platform.
- No business logic in the platform: The first principle was that the platform should not contain any business logic. The new messaging platform only needs to know what is required for the storage and delivery of messages and conversations. As far it is concerned, there are no special types of messages or participants. All are treated equally.
- Clean implementation for basic use cases: Plugins should only need to make the minimum effort to support its use case. Having our partners implement a lot of boilerplate code for each plugin caused unnecessary burden, reducing the platform's overall value to our partners, our business, and our members.
- Safe execution: A critical factor of the design was to ensure that any plugin failure would neither impact the abilities of other plugins to function correctly, nor would any plugin failure affect the platform's ability to deliver messages. We designed the system to isolate plugin failures such as excess latency, plugin exceptions, or logic bugs from the rest of the messaging platform.
Building extensibility into the platform
How a message is sent
To understand the design choices, it is necessary to first describe the data model of the message and the life cycle of message creation. All services at LinkedIn are RESTful services and the messaging platform is no different. The interface for the messaging platform includes interfaces for the conversation, participants, and the messages in that conversation. A client of the messaging platform is any service that calls the platform to send or receive messages. For example, LinkedIn.com, and our Recruiting and Sales applications are all different clients of the messaging platform.
The instance of sending a message can stem from either starting a new conversation or replying to an existing conversation. For new conversations, clients must first call the messaging platform to create a conversation with the participants of the conversation in the request. Then, for either a new or existing conversation, clients call the platform to create a new message, specifying both the author and the message content in the request.
The platform is only concerned with storage and delivery of the message. It doesn't know about any types of participants or rules concerning particular use cases. So, in order to allow our partners to implement the business logic for their use cases, we introduced the concept of "plugins." We define a plugin as a service that implements a pre-defined interface that the messaging platform specifies.
We identified the key points in the lifecycle of conversation and message creation, and the specific actions that a partner may need to take at that point, such as canceling the message or performing custom logic in their service as a side effect. We then defined an API for plugins to implement at each of these points in the life cycle. Partner teams implement the specific plugin API for the lifecycle event they desire and then register it with the messaging platform. Once done, the partner team's service will receive the callbacks necessary for their use case (see diagram below).
Plugin metadata and callbacks API
In addition to lifecycle callbacks, partner teams can add their own metadata to be attached to a conversation or message entity. The messaging platform provides storage for this metadata, but never looks into the content or schema of the metadata. The key points in the conversation and message creation lifecycle are as follows: conversationPreCreate, conversationPostCreate, messagePreCreate, messagePostCreate, messagePreDeliver, messagePostDeliver.
For every creation of a conversation or message, the platform will call all of the registered plugins at each phase of the lifecycle. Since a plugin is usually only interested in its use case, it will only take action only when it detects its own metadata. Otherwise, it will acknowledge that it is taking no action. The API for each event in the life cycle defines a contract of what the plugin can and cannot do at that point. For instance, during pre-create or pre-deliver, the API allows the plugin to cancel the action entirely or only for some participants. The key feature of the callbacks API is that the messaging platform controls what data a plugin receives and what data can be modified. All plugins are treated equally, and external teams never need to modify the platform for their use case.
This approach enables the platform to provide a powerful extension mechanism for simple use cases to spin up quickly and safely. If a callback throws an error or takes too long to respond, the platform can either ignore the failure or fail for only that message. In this way, the plugin team has ownership and responsibility for their use case's correct functioning, and the system ensures that any given plugin does not affect the platform's overall operation.
Example of a plugin use case
Invitations on LinkedIn are an interesting case of applying special logic during message creation. When a member invites another member to connect on LinkedIn, the inviter can type in a custom message to the invitee. This note is not immediately sent as a message though, because we don't want to fill members' inboxes with unaccepted invitation requests. Instead the invitation service keeps the message in its own database until the invitee either accepts the invitation or responds with a message. Since the core messaging business rules don't allow a message to a non-connection, the invitation service needs to implement a plugin to enable the new message creation in this scenario.
The Invitation team began this process by defining the "invitation" metadata, which references the invitation. The plugin implements the "ConversationPreCreate" callback, and when it receives a request with the invitation metadata in the message, it checks that the invitation is still valid as well as other business rules for the message. If these rules are verified, the plugin permits the conversation to be created.
As you might expect, this description simplifies the actual business rules needed to fully develop all the edge cases for a feature like this. During the development of this plugin, there were several iterations on the rules and we realized that we even needed some additional metadata. However, the benefit of separating the business logic from the platform was that the Invitations team was able to rapidly iterate on the implementation of their use case independently without support from the messaging team.
How it was designed to scale
While designing extensibility for the new messaging platform, the team considered scaling both in terms of service capacity and development productivity. We wanted to be able to rapidly introduce new product capabilities for messaging, yet maintain a fast and reliable message delivery platform. It was critical in our view that new use cases don't impact the overall performance. To safeguard the platform performance, we instituted strict latency requirements on all plugins. If a callback takes too long, it is treated as a failed call. Depending on the configuration for the callback, this failure can be ignored or cause the message to fail. We also implemented processes that enable us to test new plugins, or changes to existing plugins, in a staging environment first, followed with a limited audience in production.
A key component to designing the API is defining the contract for any possible response from the plugin. For instance, we wanted to allow messagePreCreate plugins to have the ability to modify the message metadata. The metadata is modeled as a map, where the key is the plugin and the value is essentially a json tree where the schema is only known to the plugin. We had to decide if a plugin returns modified metadata, then what does it mean if the plugin returns a map that doesn't have all of the original keys? Should we delete the missing keys? Should we throw an error or ignore the plugin results?
In our first pass at designing the API, we decided that the missing keys should indicate the plugin wanted to delete that metadata from the message. We thought this made sense because we wanted to allow deletes as well as add and update. Later, we realized this approach creates an unnecessary burden on the plugin developers because every plugin has to always return a copy of all the metadata that was passed in. We learned pretty quickly that this is a burden on the plugins to implement unnecessary boilerplate work, and we increased the changes that a plugin could impact the platform's ability to deliver messages. We decided to redefine the contract to enforce that plugins can only add or update metadata, so if the plugin doesn't care to update any data it can just leave that out of the response. This design created a more easily understandable system that reduced the chance for inadvertent mistakes
The choice to implement extensibility in a platform service is a commitment that first requires a detailed understanding of the use cases that need to be supported and then a commitment to think through not only how the system will work as intended, but also how developers will understand and interact with your system. Establishing some principles at the outset will help guide the design and hopefully minimize the need to do major refactoring later. If you do choose to prioritize extensibility, it is always a good idea to incorporate both a simple use case and a more complicated use case for the pilot. While the simple use case will help illuminate if any of your choices are creating a fragile or incomprehensible system, you will learn a lot about your choices by trying to build out the more complicated use case.