Hosted Search: LinkedIn Search as a managed service
December 8, 2022
Search functionality is a core part of most data-driven products, and is used widely at LinkedIn. We have long provided a central platform for search functionalities; however, it was not fully managed in the sense that the application teams needed to own and operate the corresponding resources. As data needs grow and an increasingly high number of products want to integrate search, we discovered a need for a fully managed self-service platform to completely democratize search for all of our product teams. In this post, we will talk about Hosted Search, our new search solution that allows product developers to integrate search functionality with minimal onboarding and no maintenance or operational overhead, allowing them to instead focus on their products and what they care about most: creating value for our members.
Background: LinkedIn’s legacy search
LinkedIn’s legacy search solution is built on top of SeaS (Search as a Service) and Galene libraries. The SeaS library predominantly contains the logic related to the lifecycle of services and their interactions across a search cluster while the Galene library contains low-level inverted index logic, notably, the logic related to retrieval and management of live updates to the data that has been indexed. A search use case that has been set up using this solution is referred to as a SeaS vertical, and LinkedIn application teams have been leveraging it for setting up search verticals since 2014.
To create a SeaS vertical, a SeaS-’xyz’ multiproduct (MP) is to be set up by the application team. At LinkedIn, a multiproduct refers to an independently releasable entity and consists of a set of deployables or libraries that are developed and released as a unit (read more here). The MP is first created through a template where platform artifacts, such as SeaS, Galene, etc., are wired in. Subsequently, indexing, retrieval, scoring, or ranking capabilities are to be customized by the application owners (based on the desired business requirements) in the form of various configurations and plugins that are added to the MP as custom artifacts. The custom artifacts are assembled with the platform artifacts into per use case deployable WARs (Web application ARchives) at build time. This is illustrated in Figure 1. Additionally, in SeaS verticals, logic for data transformation (i.e., the transformation of offline and nearline source of truth (SOT) data-sets before they are indexed) is also to be defined in the vertical MP.
Figure 1: A SeaS-’xyx’ multiproduct
Despite what the name SeaS suggests, the legacy SeaS solution is not a hosted/managed solution. Typically, a SeaS vertical is owned and operated by the corresponding application team and their SRE counterparts. From a site-up, operations, and monitoring point of view, a SeaS vertical is quite complex as several stateful services exist along the stack with several stateful interactions among them. The fact that application-specific logic (plugins, data transformation, etc.) is tightly coupled with the platform code in a SeaS MP makes development, operations, and incident management difficult as the boundaries between the two becomes undistinguishable, at times. Additionally, the level of search domain expertise and efforts required to set up and maintain or operate a SeaS vertical is rather high and the associated cost is not justified for the application teams.
One of the infrastructure needs that was left unaddressed for many years was the support for global (i.e., across all partitions) search in Espresso tables (Espresso is LinkedIn's online, distributed, fault-tolerant document store that currently serves as source of truth for many applications). Years ago, the idea of leveraging SeaS for this purpose was explored, but was not deemed scalable from effort level and operations perspective, and therefore was not pursued.
Hosted-Search has been developed as a hosted and cloud-based search solution where setting up, maintaining, and operating efforts are offloaded from the application teams. When using Hosted-Search, application teams develop their custom logic in the form of custom artifacts, which are dynamically pulled in to search resources that are set up and maintained and operated by the Hosted-Search engineering team. A clear boundary of data transformation versus indexing is also introduced, which simplifies developments, operations, and incident management. In the rest of this post, we will dig deeper into various facets of this solution.
The overall architecture of the Hosted-Search ecosystem is shown in Figure 2.
In Hosted-Search, the unit of service is called a tenant-index (TI) and Nuage (LinkedIn’s centralized UI for managing infrastructure and platform resources) is utilized as the customer-facing portal for onboarding and monitoring across all TIs.
The cloud of Hosted-Search resources is comprised of several HS-Clusters, each serving one or more TIs. For each TI, a corresponding remote endpoint is configured, where the applications sends their queries and receives results. All aspects of the Hosted-Search ecosystem are controlled and orchestrated by a centralized HS-Controller that interacts with several other services across LinkedIn, including platform and cloud foundations as well as monitoring services. HS-Controller manages allocation of HS-Clusters in Hosted-Search cloud and leverages Azkaban to trigger creations of offline indexes and their deployments in the corresponding online instances. Furthermore, it coordinates data transformations (using frameworks like Apache Samza) on nearline SOTs (such as Apache Kafka) and live updates and the transportation of the resulting events to search clusters.
Figure 2: Overall architecture of Hosted-Search
As also shown in Figure 2, a workflow management engine is used within the HS-Controller. For all aspects of deployments, tests/verifications, scaling, and operations, automated workflows are defined and are executed and managed by the engine.
For each Hosted-Search TI, similar to a SeaS vertical, application specific plugins and configurations can be defined to customize indexing, retrieval, scoring, or ranking capabilities. The corresponding custom artifacts are committed to an MP referred to as the TI’s artifacts-MP, which does not include any deployable WARs. In fact, in Hosted-Search, only platform deployable WARs exist; per use case (i.e., per tenant-index) WARs do not. TIs’ artifacts are dynamically fetched and bound by platform deployables at run time via layer cake (LC), a machinery that has been developed for this purpose. Since TIs’ artifacts-MPs do not include deployable WARs, unlike SeaS verticals, owners bear no responsibility in site-up, operations, and maintenance.
Layer cake machinery
Layer cake is a non-conventional machinery that has been developed to dynamically fetch and bind TIs’ artifacts. In doing so, LC utilizes private per tenant class loaders. An overview of how layer cake works is depicted in Figure 3.
Figure 3: Overview of how layer cake works
In each TI’s artifacts-MP, LC Dependency Resolver (LC-DR) is used, which is a plugin that creates a Resolved Dependency Specification (RDS) for the modules. When artifacts are published to LinkedIn’s artifacts repository, their corresponding RDS files are also published. In Hosted-Search’s LC-aware services, Dependency Manager (DM) fetches RDS files corresponding the targeted TIs, and subsequently fetches the listed dependencies from artifacts repository. For each TI, DM constructs a classpath with the dependencies of the TI, and then creates a private class loader for the TI. Each TI’s artifacts are only loadable by the corresponding private class loader and are not visible to those of the other TIs. The overall hybrid logic is rearranged and refactored so that the platform-centric logic and the tenant-centric logic are carved apart. The system class loader is used within the platform-centric logic and the private class loaders are used within the tenant-centric logic. With layer cake, multi-tenancy also becomes possible in Hosted-Search where multiple TIs can be served on a HS-Cluster in an isolated fashion. Note that the isolation is only at the code/class level and it is still possible that one misbehaved TI negatively impacts other cohosted TIs (for example by hogging system resources). Extensive testing and validation, including a blue-green deployment strategy, ensures that such issues are caught before they impact production resources.
Clear boundary of data transformation vs. indexing
In SeaS verticals data transformations are materialized in tight coupling with the platform side [indexing] logic. Data transformations are defined within the vertical MP as plugins to one of the deployables in the stack. In Hosted-Search, unlike SeaS verticals, a clear boundary between data transformation and indexing exist. This is illustrated in Figure 4. In association with each TI, a nearline pipeline and an offline workflow for data transformations may be defined. In nearline world, a nearline pipeline may be used to transform events from nearline SOTs and/or streams carrying live updates. They are then consumed by HS core consumers (see Figure 4) and are subsequently indexed. The pipeline might utilize caches for the purpose of data joins. In addition, calls to external REST resources might be made from the pipeline. The nearline pipeline is co-owned, managed, and operated by Hosted-Search team. In the offline world, an offline workflow may be used to transform and join data from offline SOTs to create a flat view of data before it is indexed by HS core logic.
Figure 4: Clear Boundary of Data-Transformation vs. Indexing in Hosted-Search
The clear boundary of data transformation versus indexing simplifies development, operations, and incident management in Hosted-Search when compared with legacy SeaS verticals.
As mentioned earlier, for all aspects of deployments, tests/verifications, scaling, and operations in Hosted-Search, automated workflows are defined and are executed and managed by HS-Controller’s workflow engine. Figure 5 depicts a simplified view for the structure of one of the operational workflows that is commonly used, CLUSTER_BLUE_GREEN_DEPLOY. Note that the figure is meant to showcase the level of complexity that goes into such workflows.
Figure 5: Structure of CLUSTER_BLUE_GREEN_DEPLOY Workflow
A workflow is comprised of several processes that are executed towards achieving a certain operational objective. Certain groupings of processes are defined as process groups. In Figure 5, processes and process-groups are shown in green and blue, respectively.
Use of automated workflows for all aspects of deployments, tests/verifications, scaling, and operations results in a significantly lower level of human engagement/supervision that is needed for these affairs. Furthermore, with the automated workflows, more advanced and sophisticated methodologies can be developed and adopted along operations, which would result in increased overall stability, availability, and health guarantee. Below are some of such advanced methodologies in Hosted-Search:
Blue-Green deployments provide a high level of protection against regressions due to code changes or data holes over time.
Canape is a whole streamlined and low-cost methodology of testing custom artifacts and provides a low-cost solution for detecting regressions before the higher-cost Blue-Green deployments are attempted.
AI-Safeguarding allows efficient test and verification of AI models before they are ramped to active production resources.
Application teams can control and manage all affairs of their use cases (onboarding, ramp, data refresh and compliance, deployments, custom artifacts versioning, testing, AI model verifications and ramps, quotas, etc.) through a Nuage portal in a self-serve fashion.
Leveraged for Global Secondary Indexes (GSI) in Espresso
For many years, global secondary indexes (GSI) and search across all partitions were not supported in Espresso tables and leveraging legacy SeaS to achieve this was deemed unscalable from effort and operations level perspective as a SeaS vertical had to be set up for each Espresso table. With Hosted-Search’s streamlined onboardings and operations, this has become possible. Hosted-Search is now leveraged to provide GSI for Espresso tables. Hosted-Search builds indexes from data stored in an Espresso table and GSI queries are routed to Hosted-Search from the Espresso router. In response, keys of documents that matched the query are returned to the router. The router then fetches the corresponding documents from the Espresso storage nodes (from all partitions) and returns them to the client. Figure 6 illustrates this.
Figure 6: Hosted-Search is leveraged for GSI in Espresso tables
Adoption so far - What’s ahead?
Based on the complexity of search features used in Hosted-Search use cases, they are categorized into two different groups: basic and complex. Basic use cases, for the most part, index data as-is, except for some basic indexing features that can be customized (for example, tokenization of text fields). Complex use cases, however, use a wide range of complex indexing and searching features (complex data-transformations, AI ranking models, etc.).
In Hosted-Search’s first year of availability, we onboarded more use cases than those leveraging legacy SeaS solution during its entire lifetime (~40 verticals). At the time of this writing,Hosted-Search is serving 70 use cases and is going strong in making it easier for the application teams to add search functionalities and providing enhanced capabilities to LinkedIn members. Although the majority of the onboarded use cases are categorized as basic so far, the number of complex use cases is growing fast. We are also in the process of migrating all legacy SeaS verticals onto Hosted-Search.
The high cost of setting up and maintaining SeaS verticals has always been a limiting factor for application teams. Hosted-Search allows application teams to integrate search functionality with minimal onboarding and no maintenance or operational overhead. With Hosted-Search, application teams can mobilize their attention and resources to the application side, allowing them to experience a higher velocity on iterating through search side improvements. Additionally, with Hosted-Search’s automated operations, all aspects of data freshness, isolation, integrity, security, and compliance are addressed in a systematic fashion, with a lower chance of missing customer requirements. The end result is a significantly improved experience for LinkedIn members.
I am thankful to the many engineers within Search Infrastructure team who contributed to Hosted-Search. I would like to thank Gaurav Maheshwari, Viral Shah, Paul Chesnais, Deepak Manoharan, and Bowen Zhou who personally helped me. I would also like to thank Dave Kolm, Vibhaakar Sharma, and Brent Miller for their leadership and support.