Enhancing Content Review: Proactively addressing threats with AutoML

Co-Authors: Shubham Agarwal and Rishi Gupta

At LinkedIn, we work every day to deliver a safe and trusted experience for our members and customers. A key part of this work is our content abuse defense systems, which function behind the scenes to help detect and remove policy-violating content from the platform, while helping surface professional, relevant, and meaningful content that adds value to the member experience. We rigorously invest in enhancing these defense systems, which are a foundational pillar to maintaining member trust and safety and delivering a positive member experience.

This blog post delves into the AutoML framework for LinkedIn’s content abuse detection platform and its role in improving and fortifying content moderation systems at LinkedIn. We use AutoML to continuously re-train our existing models, decreasing the time required from months to a matter of days, and to reduce the time needed to develop new baseline models. This enables us to take a proactive stance against emerging and adversarial threats.

Need for proactive and continual learning

Content moderation defenses need to be updated proactively and continually to stay ahead of the evolving threats on the platform. Here, we highlight three major factors that lead to the need for proactive detection and continual learning: 

  1. Data drift: This refers to the subtle yet consequential changes in the nature and characteristics of content posted on the LinkedIn platform over time. As conversations evolve, trends shift and the content stream experiences gradual transformations in the types of media and subjects of content posted. The capacity to identify these evolving content patterns is integral to maintaining an effective content moderation system.

  2. Global events: Global events affect people, the economy, and businesses across the world and have the power to swiftly reshape the digital discourse landscape online. These events trigger a surge in discussions, diverse viewpoints, and, often, misinformation. During such times, our content moderation systems need to adapt to meet the moment at the same speed as the conversations are unfolding.

  3. Adversarial threats: LinkedIn, as a hub for professional networking, faces a challenging reality: Some people engage in fraudulent and deceptive practices (e.g., creating fake profiles, impersonating other members, or running scams). To stay ahead of emerging threats, we need to be able to regularly update our models and systems.

What is AutoML?

Automated Machine Learning (AutoML) refers to a framework or platform that automates the entire machine learning process. It was originally conceived as a way to democratize machine learning for non-ML specialists. Beyond that initial use case, AutoML has matured into a valuable productivity tool for seasoned machine learning professionals.

AutoML for content abuse detection at LinkedIn

Figure 1: Demonstrating high-level steps of the AutoML framework

While building content moderation classifiers to detect policy-violating content, we observed that the most significant performance improvements often didn't arise from radically different algorithms or groundbreaking innovations. Instead, they stemmed from a series of repetitive yet critical steps: re-training on continuously expanding and recent data; learning from past mistakes (false positives and negatives); experimenting with different model architectures and hyperparameters; and fine-tuning our models. 

These steps do require an ML engineer’s expertise and experience, and an understanding of the nuances of each phase. But at the same time, we also realized that several aspects of model development could be standardized and automated, significantly reducing the need for extensive human intervention and improving developers’ productivity. Leveraging AutoML, we transformed what used to be a lengthy and intricate process into one which is both streamlined and efficient. AutoML uncovered huge potential to accelerate model development, boost accuracy, and reduce human involvement. After implementing AutoML, we saw the average time required for developing new baseline models and continuously re-training existing ones shrink from two months to less than a week. 

Advantages 

  1. Efficiency and throughput: AutoML takes on repetitive, redundant, and time-consuming tasks, such as data and feature processing, model selection, and hyperparameter tuning, freeing up valuable developer time as a result. In this evolving content moderation landscape, the ability to allocate our human resources to innovative and forward-thinking endeavors allows us to adapt and lead, ensuring our content moderation systems remain robust, effective, and timely.

  2. Standardization and consistency: AutoML pipelines enable standardization and consistency in model development and the deployment process, making the classifiers more reliable and reproducible. Moreover, the automation reduces potential human errors, such as misconfiguring parameters or inadvertently introducing biases that can impact performance or fairness of the models. The standardization of pipelines serves as a safeguard, ensuring that the benefits of automation are realized without compromising the integrity of ML applications.

  3. Exploration of multiple approaches: Content moderation classifier development often benefits from testing various model architectures, hyperparameters, and preprocessing techniques. AutoML systematically explores a multitude of possible solutions. It doesn't settle for the first solution, but rigorously experiments with different configurations, architectures, and hyperparameters, which leads to the discovery of optimal combinations that can significantly boost accuracy.

  4. Continual learning: AutoML facilitates continual learning against new and emerging threats. It enables models to stay updated by automatically retraining on incrementally larger and more recent data with a pre-defined periodicity. This adaptability is crucial in maintaining accuracy over time.

Data preparation and feature transformation

Figure 2: Data preprocessing steps and feature transformation steps automated by AutoML framework

While feature engineering used to be the province of ML engineers alone, the AutoML framework is adept at looking for common patterns and automating feature engineering as much as possible. In content moderation classifier development, there are Data ETL (Export, Transform, Load) pipelines that collect data from various sources and store it in offline locations like a data lake or HDFS. The data undergoes extensive pre-processing, including noise reduction, dimensionality reduction, and feature engineering, to create a high-quality training dataset for classifier training. Most of these steps are automated using the AutoML framework, saving data scientists’ time and reducing the risk of errors. 

Model training and selection

Figure 3: This illustration summarizes how the AutoML framework automates the model training, development, and deployment steps 

The AutoML framework trains classifiers, experimenting with multiple model architectures in parallel. It performs a systematic search over a range of hyperparameters, optimization approaches, and models, saving data scientists the effort of trying different algorithms manually. 

The AutoML framework also offers an automated process for model evaluation and selection. By taking specified evaluation metrics as input, it systematically assesses multiple trained models, identifying the top-performing model for production deployment.

The framework also automates several other critical steps leading up to deployment. These include the generation of comprehensive reports that evaluate the new model across various key ML metrics. The reports facilitate a detailed comparison with any existing baseline models, aiding in the decision-making process about whether to update the models in production. The framework is also capable of automatically setting the operating threshold, ensuring that the model operates optimally based on specific operational requirements in production settings, like running at a specific precision or recall.

Model deployment

The AutoML framework extends its automation capabilities to include the critical phase of model deployment. This process seamlessly integrates the offline training pipelines with the content moderation production infrastructure. One of the key functions of the framework is enabling the publishing of the newly-trained model to the model artifactory, so that the production machines access these models seamlessly. Moreover, the AutoML framework helps ensure that the model adheres to the contract specifying the expected input and output parameters within the production system. This optimizes the likelihood that the deployed model functions precisely as intended in the production environment, minimizing any potential discrepancies or operational hiccups.

Challenges

  • Scale: LinkedIn’s AutoML system was built to be widely used across the engineering teams at LinkedIn. One of the major challenges in building such a framework was to streamline data ingestion pipelines to make them scalable across different content sources, such as text, multimedia, and ones that combine both. We also designed AutoML to support the addition of new algorithms to different components such as data-preprocessing, hyperparameter tuning, and metric computation.

  • Optimization: The framework needed to support quick experimentation, large datasets, and multiple modeling architectures in parallel. Lots of effort was put into optimizing for build and runtime, as well as memory, to ensure that developers’ productivity would not be impacted while using the AutoML framework.

  • Usability: With all the components and parameters associated with AutoML, we wanted to ensure that the framework provided an optimal trade-off between ease-of-use and exposing the configuration parameters to developers with different kinds of expertise in ML.

Future work

  • Speed and efficiency: Scaling AutoML to all the content moderation classifiers (including Multimodal and Multi-Task Learning based models) at LinkedIn is the primary objective of building the platform. By adopting this technology at a larger scale, we anticipate a substantial rise in the concurrent execution of modeling experiments, leading to a manifold increase in the demand for GPUs and other computational resources. We are working dedicatedly towards improving the efficiency of the system to ensure that it can scale to the growing requirements and minimize the turnaround time for workflow completions.

  • Generative AI: Generative AI has the potential to improve the quality of datasets (for instance, in terms of label noise) as well as generating synthetic datasets for model training. We are exploring these types of solutions that can eventually help improve the accuracy of our classifiers.

  • AI governance: Proper AI governance helps us design and deploy content moderation AI systems in a way that continues to be safe, fair, and transparent. To build additional trust in our content moderation defenses among our members and other stakeholders, we  plan to integrate different fairness assessment solutions as part of the AutoML framework.

Acknowledgements

We would like to thank our colleagues Dhanraj Shetty, Bharat Jain, James Verbus, Suhit Sinha, Sumit Srivastava, Akshay Pandya, Prateek Bansal who worked on this project. The project demanded contributions from various other members of the team to make it a big success. Big thanks to Praveen Hegde, Shah Alam, Sakshi Verma, Tushar Deo, Abhishek Maiti, Shivansh Mundra, Abhijit KP, Nivedita Rufus, Akshat Mathur for making significant contributions. 

Many thanks to the thought leadership of Grace Tang, Daniel Olmedilla, and our management Vipin Gupta and Smit Marvaniya; to the valuable inputs from Jidnya Shah; and to our TPMs Shreya Mukhopadhyay and Abraham C  and for their support, guidance and resources.