Dynamic Machine Translation in the LinkedIn Feed

Ivan K.

Software Engineering Manager at LinkedIn

June 28, 2018

Co-authors: Angelika Clayton and Bing Zhao

The need for economic opportunity is global, and that is represented by the fact that more than half of LinkedIn’s active members live outside of the U.S. Engagement across language barriers and borders comes with a certain set of challenges—one of which is providing a way for members to communicate in their native language. In fact, translation of member posts has been one of our most requested features, and now it's finally here.

Dynamic (immediate) translations in the feed has been a tiger team effort from the get-go: a team of passionate localization evangelists and hungry engineers took on the challenge of realizing an opportunity that relied heavily on collaboration across different teams. We began with a small prototype to prove and test a concept, and ramped to a very small section of our membership. As the concept was proven successful, we used that experience to develop a more scalable solution to incorporate more languages. There are three central components that we had to incorporate: language detection, machine translation (MT), and feed experience.

Language detection and tagging

We separated the processes of content language detection and actual translation to improve the member experience with international content in the feed. Separating the content language detection step from translation allowed us to build a base for a flexible, efficient dynamic language translation, to expand support for various content types, and to generate data for the use of relevance and analytics teams.

Language detection is a near-real-time application processing high volumes of member-generated content data distributed across multiple Espresso stores. Instead of consuming directly from databases, we needed access to all the database changes without impacting the online queries. For this reason, we chose Brooklin, used at LinkedIn as a change data capture service, to stream change events from Espresso. Our language detection application consumes the change stream containing events for each write performed on the content databases.

To improve language detection quality, the data extracted by Samza jobs goes through filtering and cleansing (for example, mentions and hashtags are excluded from the language detection process).

Filtered data is forwarded via the LinkedIn GaaP Service (Gateway-as-a-Service) to the Microsoft Text Analytics API, an Azure Cognitive Service that can detect up to 120 languages. The data is tagged with language detection results, i.e., locale ID and confidence score, and is available for processing by other applications. 

In the content language detection and tagging process, we utilize multiple open source frameworks, services, and tools originally developed by LinkedIn, such as Kafka, Samza, and Rest.li.

Feed experience

The initial small-scale prototype on short-form member posts involved the implementation of a “See Translation” button whenever the language of the post, detected through a separate network call to the Microsoft Translator API (another Azure Cognitive Service), did not match the member’s interface language. When clicked, the button would display the text translated into the member’s interface language. The prototype was a proof of concept for internal ramping and a very limited external ramp, as a learning and evaluation exercise.

The prototype was very successful in that member feedback was positive both in terms of the value of the feature itself and of the quality of the translated content. The prototype also allowed us to identify several areas that needed to be improved before we ramped to all members and all feed content:

Locale detection: When the prototype was released, our service was making dual calls to Microsoft, one for language detection and one for translation, which was fine for a prototype, but too slow to scale the experience. It also meant that we did not retain the locale of unique content for statistical analysis.
Locale comparison: This is a new logic that did not exist in the prototype. Now, we take the inferred locale set asynchronously by language detection and compare it with the member's interface locale. We no longer need to request this from Microsoft, as we were doing for the prototype, which significantly reduces the number of calls made. We now only render the “See translation” button if those locales are different, which makes for a much more intuitive member experience.
Other content types: The prototype only worked on original posts, and the new model renders the functionality also on root shares, viral shares, and re-shares of organic updates.

Our current design is split into two main flows: Translation Render and Translation Trigger.

Translation Render flow:

Translation Trigger flow:

Polyglot-Online

The Polyglot-Online mid-tier service uses GaaP to safely send encrypted text snippets to the Translator Text. An additional advantage in this framework is the ability to customize the translation models for a specific domain (like our feed) and integrate logic for filtering translation outputs based on system confidence scores. The API supports more than 60 languages in any translation direction, all of which we can leverage once the source language locale of a piece of content has been detected. For this feed feature, we selectively translate source text into 24 target languages, to match each member's interface locale supported by LinkedIn.

This translation service also has features like logic for protecting entities such as hashtags and name mentions from being distorted in translation, and integrated filters to block irrelevant or unprofessional content, as well as advertisements, from being translated on the LinkedIn platform. We also use an in-memory encrypted cache to reduce latency, with its lightweight maintenance nature and better cost-to-serve than centralized solutions the Java Play framework at LinkedIn, the service easily supported multiple thousands of QPS during our prototype ramp.

Acknowledgements

Many thanks to Weizhi (Sam) Meng and Chang Liu for great coding and ownership, to David Snider for initiating the project, and to Annie Lin for writing GaaP scripts.

We also want to thank Ian Fox for his work with Azure, Pradeepta Dash for engineering support for the feed, Atul Purohit for guidance with the feed API implementation, Jeremy Kao for guidance with web, Samish Kolli for client-side support, Nathan Hibner for his many contributions in tweaking the model, and Chao Zhang for the expert answers about overall backend functionality.

Additionally, we want to recognize our helpful friends at Microsoft: Ashish Makadia, Assaf Israel, and Brian Smith from the Text Analytics team, and Chris Wendt and Arul Menezes from the Translator team.

Finally, a huge thank you to Francis Tsang and T etyana Bruevich for their endless support.

We hope our members enjoy this new feature!

Topics: Code Machine Learning