JARVIS: Helping LinkedIn Navigate its Source Code
August 22, 2017
Codesearch is vital for any tech organization that operates at a large scale, and LinkedIn depends heavily on it. Engineers at LinkedIn use codesearch frequently to see how certain APIs are being used, how configurations of certain services look, how some of the classes/methods/schema/etc. are defined, and more. Codesearch helps our engineers navigate through source code without needing the code to be checked out on their local machine.
But besides the above capabilities, we wanted our codesearch tool to be able to do even more. We wanted to have a system that understands the LinkedIn codebase so that it can empower our engineers to find what they are looking for with fewer searches and in less time. We also wanted to give our engineers the same look and feel that they get from an Integrated Development Environment (IDE).
In summary, we were looking for an intelligent search system to manage LinkedIn’s source code which would:
Find matches for queries very quickly, helping our users to design narrower queries;
Have relevance logic that ranks important results higher;
Support a rich query language that could solve different use cases;
Resolve references if corresponding definitions were available in the corpus;
Support autocomplete to show matching entities to users (e.g., without them having to type complete text);
Be integratable with our IDE and Command Line Interface (CLI).
To meet these needs, we designed JARVIS, a search system based on client-service architecture, in 2016. JARVIS fits all of our requirements listed above and works with many clients, including web, IDE, and CLI. In this blog post, we’ll discuss how we have implemented our backend search service, some of the challenges we’ve faced, and how we have solved them.
Like any other search service, codesearch also needs to accomplish three important tasks: indexing, querying, and relevance. We used Galene as the platform to build JARVIS because it helps us manage search clusters and gives us control over indexing, retrieval, and relevance. Galene fits nicely with solving our three major tasks of indexing, querying, and relevance.
When we started building our services, we used to do metadata extraction on the same machine that crawled our repositories to fetch source code. Doing this was very slow and it was prohibiting us from adding any more complicated metadata extraction processes. This slowness was also hampering our ability to re-create the dump when we needed to rebuild the base index. To overcome this limitation, we moved the metadata extraction process to Hadoop. We were able to scale our metadata extraction for a lot of new use cases very easily. We parallelized the metadata extraction on Hadoop by splitting whole data into chunks and assigning one chunk to a mapper. This helped the whole base index build flow to be completed in less than 2.5 hours.
We do document analysis to essentially extract metadata like:
SuperClasses (both Interface and Class)
We resolve references (usages) for which definitions are available in our source code; resolved references are shown as a link. Users can click these links to go to the reference’s definition. This is why line numbers for definitions are very important—so that a user can be taken to exactly where a reference is defined. For example, the UI for code search in browser looks like this:
Clicking on the link FullWordSourceCodeAnalyzer in line 19 opens the corresponding file and goes directly to the line where it is defined. See the image below (line 14 is highlighted):
We do this metadata extraction to support search use cases like:
Finding files where a class/method is defined
Finding classes which extend/implement a class/interface
Finding files which use a class/method
Searches based on packages and imports
Having search support over metadata helps our users narrow down the results very quickly.
For creating links for references (class usage or function usage) that users can click on to go to the definitions, we must first know whether clicking on a link will land on a definition or not. We create links only for those references which we know are resolved. By resolved, we mean we have the source code that defines that reference and hence, if you click on the link, we can open that file. As a result, we don’t create such navigational links for references that are defined in some third-party library. Currently, we do reference resolution for Java or Java-like files. By “Java-like,” we mean instances where usage patterns are the same as in Java, e.g., Avro, Pegasus, and Java class usages in Spring files.
Reference resolution is a very complicated process, as it requires whole state of the world. We need the complete linkage graph for resolution. One could maintain a system which contains the metadata for definitions that can be queried. This type of system could be used at query time while serving results to users, but we would need to call this new system to check if any of the function/class usages in the file that is being returned could be resolved. Maintaining such a stateful system would be a nightmare and could also potentially impact the latencies. So, we decided to take the Hadoop route again. We have a Hadoop flow that works on the definitions in Java, Avro, Pegasus, and Scala files and on usages in Java, Avro, Pegasus, and Spring files. It does some map-reduce magic against these two data sets and produces the information about all resolved class-usages and function-usages in a file. Because a resolved reference packs all the information it needs to arrive at the definition, at runtime, a document/file in our index contains all the information that it needs for decoration.
Our Hadoop workflow for metadata extraction, reference resolution, and preparing the document that goes in the index is given below:
Creating an index
Indexing is the most important part of building a search service because it sets up the foundation of building a search system. For JARVIS, we created a Galene index that has a Lucene index at its core. Indexing necessarily means building a document of fields and marking a subset of fields as inverted index fields and, similarly, a subset of fields as forward index fields. Fields that are part of the inverted index field set are the ones against which a search can be performed, while forward index fields are used for relevance and decoration of the document that gets returned for a query. Fields in an inverted index can be thought of as an index for a term, containing all of the documents where that term occurs. We use this “index” to find matching documents for query terms. Forward index fields are kept as-is; each document that gets returned as a match for an input query contains value for forward index fields, which we use in scoring for ranking and for decorating in frontend.
Fields that are part of an inverted index get tokenized. The tokenization process involves creating terms out of a sequence of characters by using a suitable analyzer. A term does not necessarily mean a word, and based on the need, one can create multiple terms from one word as well. For example, a user should be able to:
Search matches for any keyword as substring.
Search matches for a keyword in a given case (as given in the query) inside content of the file.
Search matches for full word—basically if they want to avoid proper substring matches within a word.
Search matches for a phrase inside the content of a file; phrase matches should not span across multiple lines.
Omit a few characters (from beginning or from end or from in-between) from the search keyword if needed. Basically, *x OR x* OR x*y type of query, where x and y are a continuous sequence of non-whitespace characters.
To support searches for the keywords as described above, we wrote our own analyzers, which take the content of a field and produce tokens out of that. We have a total of three types of tokenizers in use currently:
Trigram Analyzer: Every continuous sequence of three non-whitespace characters is a token.
Unigram Analyzer: Every non-whitespace character is a token.
Substring Analyzer: Breaks the sentence at word boundaries (typically non-alphanumeric characters) and then produces all substrings as tokens.
Trigram and Unigram Analyzers also manipulate positional offset of tokens intelligently, which we leverage while searching a document from the index. This helps us do phrase query and proximity query.
Now let’s concentrate on how we go about our querying and relevance. Before we go deep into that, here is a diagram outlining the various pieces involved.
Querying is another important part of a search system. It’s very important for our users to be able to express their requirements by typing in the search box. We support various types of searches:
By default, we do a substring search;
But if a user has a requirement to find something as a whole word, they can use ^(start of a word) and $(end of a word);
We also support phrase searches, which require bookending the keywords with double quotes(“”);
We also support * queries, which can be used to search keywords by skipping a few characters in between. For example, hello*world means “find those files which contain hello followed by world, and there could be zero or more characters in between”.
We ensure that phrase matches happen in one line by manipulating token positions during indexing when a line changes.
For * queries (sometimes also referred to as “proximity queries”), we also ensure single line matches; again, this is possible because token position for the next token change by a predefined number, and while creating a proximity query, we ensure that our proximity is less than that number. Currently, for an x*y type of query, we are able to find matches with a skip of up to 20 characters in between. But this is a configurable number, and we can easily increase or decrease it if we see a need.
Boolean query syntax
We use Boolean query syntax, which supports OR, AND, and NOT as the basic operators. We also have support for context, which can be built by opening and closing a bracket. Few characters have an important meaning when building a Boolean query, but these include colon [:] and open/close bracket [(/)]. But a user might want to search for these special characters as well; to make such a query possible, we also support an escape character function. One can escape a character by using backslash [\]. A user can escape any character—it doesn’t necessarily need to be a special character. Our backend query-processing engine cleans up the query before forming a Galene query for retrieval; for this reason, escaping doesn’t do any harm.
We have exposed various constructs to give the user control, which they can use to restrict searches.
|filename:Analyzer.java||Search for a file whose name contains Analyzer.java|
|filename:Analyzer.java AND package:com.mycompany||Search for a file whose name contains substring Analyzer.java and package contains substring com.mycompany|
|filename:Analyzer.java AND (package:com.mycompany OR package:com.yourcompany)||This example shows how you can use parentheses to build context. This query can be read out as: search for files whose name contains Analyzer.java and package contains either com.mycompany or com.yourcompany|
|filename:^analyzer.java$||Search for a file whose name is analyzer.java|
|code:^analyzer\$||Search for a file whose content has a keyword starting with analyzer$|
There are many filters available, such as import, superclass, method, functionusage, path, code, case, etc.
Case is a special filter our users use to do case-sensitive search. This ensures that a given keyword is matching exactly in the same case as given by user.
A raw query means that the user has not provided a filter for a keyword. In this case, we try to match the provided keyword in at least one of the fields. One exception here is a multiword phrase query—these queries are supported only in the code/content of the file. When doing query processing, if we see there are such phrases given without any filter, we fill scope the search for that phrase in the code alone. A single-word phrase is treated as a standard word; hence, it will be searched in all scope and will be deemed as a match if the keyword is matching in at least one scope.
|Raw Query||Corresponding Boolean Expansion|
|Hello World||(Hello matching in at least one field) AND (World matching in at least one field)|
|Hello||Hello matching in at least one field|
|“Hello”||Same as above|
|“Hello World”||Search for phrase “Hello World” in code|
We support typeahead/autocomplete on all fields except for code and case. Typeahead helps the user select the keyword that they are looking for without having to type the entire thing. Our typeahead is context-aware—it takes care of the query entered so far by the user.
|filename:analyze||Typeahead results shown will be filenames starting with analyze|
|package:^com.mycompany$ AND filename:analyze||Typeahead results shown will be filenames starting with analyze and that are in package com.mycompany|
|package:^com.mycompany$ OR filename:analyze||Same as first query; typeahead results shown will be filenames starting with analyze|
Relevance is a very important piece for any search system, and our codesearch is no exception. It is very important to show files at the top that users are most likely to open. Relevance for us involves assigning a score to every retrieved document, ranking them based on this score, and then returning the top K results.
Score assignment is the most critical part for our relevance. We have multiple types of features, and the final score is a linear combination of scores for each feature.
Some of the high level features are:
Match Info Score
Query Interpretation Score
File Size Score
This is a very important component of our score. It helps us keep the final result as topical as we can. A match in each field incurs a boost, and the amount of boost a match receives depends on the importance of the field. We have different weights for different fields.
We also have a Hadoop flow to assign scores based on the number of inward and outward edges. We compute this score at both a file- and a project-level to determine its importance. For a Java file, we take the imports and construct the dependency graph from this file to all the import class files. Similarly, for a project, a dependency graph is constructed based on the dependency information in its build files. Hadoop flow uses PageRank implementation from Jung to compute these scores.
Above flow is run on these graphs in two ways. First, it is run with the edges in the dependency graph from a dependant node to its dependency node. The scores computed in this phase are called “authority scores,” where nodes having higher scores means that they are source files/libraries, etc. For example, a source file would be at a higher score compared to its test file and, similarly, a library would be at a higher score compared to its dependencies.
Next we invert the edges in the dependency graph, so that the edges are from dependency to its dependent node. The scores computed in this phase are called “hub scores.”
The intuition behind computing hub scores is that nodes (files/projects) where the integration happens (that is, they refer a lot of files/libraries) will have a higher score compared to others.
So, we end up with four different scores for each file:
Hub score for a file
Hub score for the project where this file resides
Authority score for a file
Authority score for the project where this file resides
We take a weighted sum of these scores, with a higher weight for authority scores, to compute the final importance score for a file.
Interpreted Query Score
We try to interpret the query, and if there is a match for query interpretation in the retrieved documents, we try to boost its rank by increasing its score.
File Size Score
This we use to demote files that are too big or too small.
Having a tool like JARVIS gives a big operational boost to an engineering organization. Today, code search is being used by Devs, SREs, and Ops alike—it has become the defacto way of finding code at LinkedIn.
JARVIS is able to scale horizontally with little friction, which helped us onboard various types of code repositories with ease. It also supports nearline ingestion, meaning that after the code is committed, it will be available for searching in nearline fashion.
Building JARVIS at LinkedIn gave us a taste of problems which are not commonly seen in other search systems. For example, tokenization for us is completely different and much more sophisticated than many other search systems, because the expectation from the query here is vastly different. It also gave us the opportunity to showcase how Galene can empower such a search system.
JARVIS is the result of collaborative efforts from Bangalore Search, Bangalore Tools, and the Bangalore Search SRE teams. We would like to thank Chandramouli Mahadevan, and Abhijit Belapurkar for their guidance and support; Manoj K. Sure, Shubham Agarwal, Manoj A. Bode, Akhil Thatipamula, Sachin Hosmani, and Mansi Gupta and all the interns for their contribution toward search backend and frontend; Prince Valluri and Naman Jain for building the data pipeline; Binish Rathnapalan and Gaurav Gupta for helping us do timely release; and Sanjay Singh for QA support, which helped us make our code more robust. Lastly, we would like to thank Galene team as well, this would not have been possible without their platform support.