Search Federation Architecture at LinkedIn

Yi Shen

Software Engineer at LinkedIn

March 14, 2018

Co-authors: Yi Shen, Claire Liu, and Ali Mohamed

Introduction: A brief history of Search Federation at LinkedIn

Almost every part of LinkedIn contains data that needs to be discoverable by our members or customers. Use cases range from a member looking up a news article posted by someone in their network to a recruiter looking for candidates on the platform. One of the primary mechanisms for discovering this content is search.

Accordingly, at LinkedIn, we have many different search engines for a variety of use cases, such as People Search (members) and Job Search (posted job opportunities). The Search Federation team’s mission is to help our members and customers find resources by searching across these engines. We call these use case-focused engines “search backends.”

We also provide a family of mid-tier services designed to answer search and search typeahead queries. These services typically follow the federation pattern, where the user’s query is first expanded through query understanding, spell checking, and other techniques. This expansion is used to write queries specific for each search backend. The documents returned from the backends are then combined to be most useful to the user through a process called “blending.”

Collectively, this process makes up Search Federation at LinkedIn. Search Federation provides us with a way to personalize ambiguous queries. For instance, when a user searches for "machine learning" on LinkedIn, they could mean to search for people with machine learning skills, jobs requiring machine learning skills, or content about the topic.

Search Federation mid-tier building blocks
Historically, our Search Federation architecture has consisted of three mid-tier services.

The federated-search-rest service was created in 2012 to provide a Rest.li API for the Search endpoint and served as a Search Federation mid-tier for the blended search experience. Additionally, federated-search-rest calls the search-decoration service to decorate the result with unindexed information.

The typeahead-rest service was created in Q1 2014 to provide a Rest.li API for the typeahead endpoint and to serve as a typeahead federation mid-tier for the blended typeahead experience.

The seas-federated-search service was created in Q2 2014 to be the new federator used by all external services to access search. During the Galene migrations, all calls from federated-search-rest to the legacy backends were moved over to seas-federated-search. In this blog post, we use "new federation mid-tier" to refer to seas-federated-search service.

Like with any service, over the years product requirements have changed, the number of callers has grown, and an increase in use cases has resulted in a good amount of legacy code. At the same time, there has been a tremendous increase in demand for ease-of-use and rapid iteration from developers modifying Search Federation architecture (110+ code contributors and 800+ commits, for instance, in 2017) for different verticals.

In early 2015, the team identified three main challenges in our legacy federation architecture that were impeding our ability to iterate quickly and threatened search stability as our scale increased:

Leverage between search and typeahead - Similar federation logic in these closely-related mid-tiers was written with different design patterns. This meant adding a new feature would always involve changing in stacks of multiple services and making separate deployments.
Code isolation - The legacy design was based on inheritance so that verticals serving different use cases were inevitably coupled. On top of that, it wasn’t easy to extend for new vertical use cases in a clean way, and complicated if-else clauses were added in many shared components.
Test and deployment - Low test coverage in 2015 and no Search Federation services supporting daily deployment made it difficult to reliably deploy changes without regressions.

Roadmap: Building a bridge to the future

To address the above problems, the Search Federation team began implementing a redesign of the federation service with the following strategies: services consolidation, code isolation, and improving our overall test coverage.

Services consolidation:

We merged the legacy search federation mid-tier and legacy typeahead federation mid-tier into a new federation mid-tier to ease leveraging between search and typeahead.
Access to all backends now goes through this new federation mid-tier, the same place where query understanding, spell checking, and data fetching are happening.

Code isolation:

We used a workflow framework to compose a use case specific workflow, which provides code isolation among verticals. No if-else clauses are needed, and new use cases can be onboarded easily by reusing common workflows.
Modularize code per vertical use case.

Overall test coverage improvement:

Add more unit tests to increase code coverage.
Add more integration tests to enhance PCx test.
Add dark canary hosts.

Additionally, this had to be accomplished while accommodating the search needs of a rapidly-growing global membership, which was dramatically increasing the overall amount and number of types of content that needed to be searchable on LinkedIn. These issues needed to be addressed without disrupting the ~700 million searches that LinkedIn handles every single day.

Progress so far

The Search Federation team kicked off the Search Federation Re-architecture Project by building the SeaS workflow framework, then we rolled-out the migration in stages: individual workflow migration, typeahead federation consolidation, and Search Federation consolidation.

Workflow migration (2016)
Starting in Q4 2015 and working through Q2 2016, we successfully migrated all query processing to new federation mid-tier and all 40+ search use cases to SeaS workflow. Access to all Galene backends was routed through this new federation mid-tier.

Search Federation architecture at the end of 2016

Typeahead federation consolidation
Through the remainder of 2016, we made a lot of progress for typeahead consolidation, where we extracted typeahead-federator (core typeahead fanout and blending logic) out of the legacy typeahead federation mid-tier and integrated it into the new federation mid-tier. After that, typeahead-rest served solely as a proxy/adapter for API calls. Following ramping for federated and blended typeahead requests to 100% was done in Q1 2017 without any impact of relevance and operational metrics.

Search Federation architecture at Q1/2017

The typeahead consolidation helped reduce the QPS of the new federation mid-tier by half, from 16K to 8K. There were 0 GCNs caused by this migration, and no noticeable latency change.

Search Federation consolidation
Besides typeahead federation consolidation, another big missing piece of the re-architecture was to move search-federator (search fanout and blending logic) from the legacy search federation mid-tier to the new search federation mid-tier. There were a few challenges for this move, such as:

A large number of deprecated Federated Search Rest.li API parameters and half-ramped features owned by different verticals. To clean them up, collaboration and coordination across multiple teams was critical.
A good amount of legacy undocumented code was present in search-federator, which also added difficulties to the migration. For example, to determine query intent and select search verticals to fanout, the legacy search federation has a built-in intent prediction system, Dreamweaver (created in 2013). Dreamweaver also depends on an older version of Lucene, which we were in the process of deprecating.
Partner teams could be aggressively onboarding and testing new features during the migration, which in the meantime also needed to be synced in the new federation mid-tier.

Based on the understanding of the challenges and the experience gained from typeahead federation consolidation, the team allocated a third of its resources over the past three quarters to complete this significant re-architecture. Here are some highlights from this consolidation:

Documentation: We chalked out a migration plan and documented it in an internal design document that was shared with the rest of the engineering and product teams.
Lining-up with the ReMix roadmap: ReMix is the next-gen, developer-friendly workflow framework that supports asynchronous and synchronous task execution. Operator is the minimum logical unit in ReMix, which composes workflows, and gets executed by the ReMix engine during runtime. In order for ReMix to easily integrate with the current architecture seamlessly and have enough performance and correctness proof for baseline, the Search Federation migration was implemented in both SeaS and ReMix versions, and both workflows were sharing the same set of ReMix operators which fulfill the core functionality.
Leverage parity check framework: Correctness of migration results is always a pain point for any migration. A parity check library was initiated by the Federation Infra team and Relevance Infra Tools Engineering team. The migration has embedded this framework to compare results from legacy and migrated flow, before we started the A/B testing. It provides early detection of obvious issues before exposing the system to end users.

Search Federation architecture at Q4/2017

As we write this update in the middle of Q4 2017, we’ve made the following progress on consolidating our Search Federation architecture:

Completed all the code change for the new architecture, which is ramped to 5% of members.
Passed correctness/parity test: 95% of the primary results are identical, and the other 5% difference is due to query tagging and spell-check platform upgrades.
No impacts on system stability: no GCNs occurred during the entire consolidation process.
No negative impacts on relevance or operational metrics were reported during this time.
Consolidating these systems alone lead to a performance improvement of 3% up for P90 latency.

Future Work

Search Federation consolidation will continue ramping until Q1/2018 when it is completed. Then, collapsing the Typeahead Rest.li endpoint and Search Rest.li endpoint into the new search federation mid-tier will bring us to below state.

Future Search Federation architecture

Besides, there is more to be done in the future to support the demand from verticals for more agility:

ReMix workflow migration: All SeaS workflows will be replaced by ReMix workflow, which makes it easier for Search Federation developers to develop, iterate, debug, test, maintain, and operate federation use cases.
Better vertical isolation: As a monolithic service, seas-federated-search hosts many vertical use cases, in which any misbehaving vertical change may block the production release for all others or increase the latency for other verticals or even clog the whole search service. The Search Federation team is working on finding a better way to isolate the vertical use case to ease the problem.

Takeaways

The work we’ve done over the last two years to re-architecture our Search Federation infrastructure has provided the following advantages for LinkedIn:

Common logic centralization and reusability: Both search and typeahead federation logic are gathered into the new federation mid-tier, which means some sharable federation components (e.g., data fetchers, query understanding, spell check, etc.) can be reused easily, increasing leverage and future scale.
Ease of use: This simplified architecture lowers the learning cost for customers who are looking to take advantage of our search federation stack. Additionally, they no longer need to modify/deploy multiple services in order to onboard a new feature.
Code isolation: Using the workflow composition mechanism in the new federation mid-tier improves the extensibility of customized workflows per vertical use case.
Rapid iteration: We have increased the test coverage to 87%, and most Search Federation services now support daily deployment. This allows for faster iteration and improves the overall long-term stability of the system.

Acknowledgements

Many thanks to the teams and individuals involved in migration for their constant help. We’d like to call out the following teams: Federation Infrastructure, Flagship Search, Relevance Infra Tools Engineering, Search Relevance and Foundation, People Search Relevance, Job Search, Job Search Relevance, Content Search Relevance, and Search SRE.

Topics: Analytics Product Design Research