Combining LinkedIn’s Content Filtering and Microsoft Cognitive Services to Keep Inappropriate Content Off Our Sites
July 30, 2018
In a previous blog post, we described how a combination of human and machine intelligence keeps LinkedIn’s feed professionally relevant. In this post, we’ll go into more detail about how we have enabled two-way integration between Content Moderator, a Microsoft Cognitive Service running on Azure, and LinkedIn's Universal Content Filtering (UCF) platform to continue this critical work of keeping our feed relevant and appropriate.
Microsoft Content Moderator uses term-based filtering and machine-assisted detection to uncover potential profanity and text that may be deemed inappropriate depending on context. Content Moderator helps Microsoft’s internal services as well as Cognitive Services subscribers detect any inappropriate user-generated content on sites, apps and services.
Content Moderator’s machine-assisted scanning covers text, images, and videos. LinkedIn’s UCF platform utilizes LinkedIn’s in-house knowledge base and capabilities to classify images, text, and videos along similar categories. It operates as a single source of truth for detecting inappropriate content on LinkedIn. Despite their similarities in purpose, the two tools have unique components that, when combined, are extremely beneficial to us. First, Content Moderator’s classifiers are trained on content previously unseen on the LinkedIn feed, which allows us to increase the volume of inappropriate content we can successfully classify. In other words, by combining LinkedIn and Content Moderator classifiers, we hope to improve both recall (i.e., the total amount of poor quality content caught) and precision (i.e., keep the number of false positives low).
The integration also helps create a center of excellence for all classifiers across both Content Moderator and UCF stacks. Cognitive Services customers will in future benefit from LinkedIn’s classifiers that are trained on high-quality human-labeled images through LinkedIn's human and machine solution.
Finally, as we deepen our integrations, LinkedIn’s engineers will also have an opportunity to learn the Microsoft stack and actively contribute to it, resulting in the cross-pollination of ideas.
How It Works
It was apparent to us that these systems could benefit from each other. Therefore, we have created a bi-directional bridge between UCF and Content Moderator.
In the inbound direction, every time an image is posted on LinkedIn’s home feed, an API call is made to Content Moderator. We will soon add a call to Content Moderator when text is posted on LinkedIn’s home feed, too. To preserve member privacy, the outbound call only contains the publicly available image URL that needs to be verified for its quality. Content Moderator then returns classification results to the caller inside LinkedIn’s network for further processing.
In the outbound direction, LinkedIn has begun contributing image and text classifiers for integration into Content Moderator so that they can become new capabilities for internal and external customers as part of Microsoft Cognitive Services. As we learn more from the first wave of integrations, we will contribute additional classifiers and possibly open these classifiers up to non-LinkedIn clients.
In the process of this integration, we had to take care of several requirements.
First, of course, was to ensure that the two platforms could talk to each other. The following diagram shows how we have connected the Content Moderator service with the UCF service.
As shown in the diagram above, when content gets created on LinkedIn, we trigger classification within LinkedIn while at the same time triggering a classification request with Content Moderator via a different service that coordinates this system call. Once a response arrives from Content Moderator, the two answers are posted through our Kafka stream processing framework to result in the final classification of the content.
The next step was to incorporate LinkedIn’s classifiers into the Content Moderator service. The LinkedIn Relevance team primarily uses TensorFlow to create deep learning models. However, Cognitive Services uses the Microsoft Cognitive Tool Kit (CNTK) as its backbone for classification. The two libraries utilize different model specification formats, and a conversion from TensorFlow to CNTK was necessary. Fortunately, this turned out to be possible using the Model Management Deep Neural Network library that allows easy conversion between different deep learning libraries.
While converting these models from TensorFlow to CNTK, we needed to be confident that model behavior remained unchanged after conversion. In other words, we needed to ensure that the same image, when fed to the TensorFlow-formatted model or the CNTK one, resulted in the same model scores and labels. To ensure this, we spent the most amount of our time in testing.
While working with images (or any data for that matter), you have to ensure that preprocessing steps like image resizing and normalization operations are equivalent between the two model formats as well as to the model behavior. Without this, minor differences in such data preparation steps may lead to a significant score disparity between the different libraries. As a first integration, the testing was done manually by comparing scores output from the TensorFlow codebase and the CNTK codebase and making sure that they matched.
The final step was to ensure that the serving-time classification scores matched those we saw while running our classifiers offline on the same test images. While we confirmed this through manual verification, we plan to make sure that the two libraries integrate more seamlessly by writing common libraries for preprocessing, unit testing, and benchmarking the behavior of classifiers.
As follow-up work, we will need to ensure that as we scale up our integrations and onboard more classifiers on to Content Moderator, we will continue to meet classification latency requirements for the LinkedIn site, keep member data safe, and ensure seamless cross-platform diagnostics, tracking, and telemetry.
There are efforts already underway in several of these areas. We look forward to writing more blog posts in the months to come about the corresponding work.
Work we described here is part of a multi-team collaborative effort involving several organizations across LinkedIn and Microsoft. We would like to acknowledge the following leaders and their teams that empower UCF and Content Moderator services.
At LinkedIn, this includes the Trust Engineering team members, Chandramouli Mahadevan, Arumay Das, Srividya Krishnamurthy, Sachin Kakkar; Trust Product team members Karthik Viswanathan, Anindita Gupta, and Madhu Gupta; Tim Jurka from the Feed AI team, David Max from Content Ingestion, and Vivek Nandakumar from Trust and Safety.