Using deep learning to detect abusive sequences of member activity
September 2, 2021
Co-authors: James Verbus and Beibei Wang
The Anti-Abuse AI Team at LinkedIn creates, deploys, and maintains models that detect and prevent many types of abuse, including the creation of fake accounts, member profile scraping, automated spam, and account takeovers. As we prevent abuse using machine learning, there are several challenges we can face:
Maximizing signal: traditional features engineered by humans may not fully leverage the available signal in rich member activity patterns
Adversarial behavior: attackers are quick to adapt and evolve to evade anti-abuse defenses
Many attack surfaces: there are many dynamically-changing, heterogeneous parts of the site to protect
To address these challenges, we have productionalized a deep learning model that operates directly on raw sequences of member activity, allowing us to scalably leverage more of the available signal hidden in the data and stop adversarial attacks more effectively. Our first production use case of this model was the detection of logged-in accounts scraping member profile data.
Scraping is not always bad. Search engines are expressly authorized to scrape in order to collect and index information throughout the internet. What makes it nefarious is when it is done without permission. Unauthorized scraping refers to the automated collection of data from LinkedIn without the permission of LinkedIn or our members. One strategy that unauthorized scrapers use to collect data is to automate the behavior of accounts logged in to LinkedIn. Logged-in scraping can be performed by real accounts (e.g., members using browser extensions to automate their behavior) or fake accounts (accounts that do not correspond to a real person that are created by bad actors to scale their ability to scrape data). To learn more about unauthorized scraping and the multi-faceted approach LinkedIn is taking to prevent it, see this blog post.
In this post, we first provide a technical overview of our activity sequence modeling technique. Then, we describe the above challenges in more detail and outline how activity sequence modeling provides a solution.
Activity sequence data
As a member visits LinkedIn, the member’s web browser makes many requests to LinkedIn’s servers; every request includes a path identifying the part of the site the member’s browser intends to access. In order to directly and scalably leverage the rich signal from member activity patterns, we created a standardized dataset capturing the sequence of member requests to LinkedIn.
Figure 1 below shows a visualization of profile scraping activity by a logged-in member. An example of a member activity sequence is shown in the bottom half of this figure. It corresponds to the first burst of profile views by this member. This mock example illustrates how member requests to LinkedIn are arranged in a sequence that includes information about the type of request, the order of requests, and the timing between requests. This sequence can be thought of as a “sentence” that describes the member’s activity on LinkedIn.
Figure 1. Mock visualization of a member activity sequence. The y-axis represents the distinct profiles viewed—each unique profile viewed is assigned a new identifier. The x-axis is time. This particular member viewed five bursts of approximately 20 profiles each, while occasionally revisiting the same profiles they had viewed previously (visible in the plot when the member returns to a previously visited distinct profile identifier). The member activity sequence is shown in the bottom part of the figure for the first burst of 25 profile views by this member.
We use an automated process to translate the specific path in each request a member makes into a standardized model vocabulary:
Standardize request paths: Translate each specific request path into a standardized token that indicates the type of request (e.g., profile view, search, login). For example, the path linkedin.com/in/jamesverbus/ corresponds to a profile view. This is done in an automated way that does not require human curation.
Encode as integers based upon frequency: Map the standardized request paths to integers based upon the frequency of that request path across all members. This allows us to provide information about how common a given type of request is to the model. This integer array is the activity sequence that is fed into the deep learning algorithm.
The resulting encoded activity sequence can be visualized to great effect as shown in Figure 2 and Figure 3. Figure 2 shows the first 200 requests (20 requests per row) for a member that was not using abusive automation. Figure 3 similarly shows the first 200 requests for a member that was using abusive automation.
Figure 2. Encoded activity sequence showing the first 200 requests (20 requests per row) made by a member that was not using abusive automation. The figure is read left to right, top to bottom. The first request is on the top left and the 200th request is on the bottom right. The color coding indicates how common each request path was across all members.
Figure 3. Encoded activity sequence showing the first 200 requests (20 requests per row) made by a member that was using automation to scrape profile data. The figure is read left to right, top to bottom. The first request is on the top left and the 200th request is on the bottom right. The color coding indicates how common each request path was across all members.
The difference between the normal member (Figure 2) and the abusive scraper (Figure 3) is easily visible by the human eye. The scraper’s activity is more homogenous, while the normal member’s activity is more heterogeneous. It is difficult for bad actors using automation to simulate the subtle patterns of requests created by normal, healthy, organic member behavior on the site.
Activity sequence model
We use natural language processing (NLP) techniques to classify these sequences. A classic NLP use case is classifying the sentiment of a sequence of words--for example, a movie or product review. In our case, member requests representing user actions replace words as the tokens comprising our sequences. Instead of classifying the sentiment of a sequence of words as positive or negative, we classify a sequence of member requests as abusive or not abusive.
We use a supervised long short-term memory (LSTM) deep learning model to produce abuse scores from the encoded request path sequence. We leverage the type, order, and frequency of particular requests paths using the encoding visualized in Figures 2 and 3. After some preprocessing of this request path sequence data, we concatenate the sequence of time differences (t) between consecutive requests so that we can leverage timing information. Figure 4 shows a conceptual diagram of the model architecture.
Figure 4. A conceptual diagram of the model architecture.
The training labels are chosen based upon the type of abuse we aim to detect. For our first production use case, we trained the model to detect logged-in accounts scraping profile data. The ground truth labels used to train the activity sequence model are generated by another model: an unsupervised outlier-detection model based upon our open source isolation-forest library (see this LinkedIn Engineering Blog post for more information on isolation forests).
When the activity sequence model detects a member as scraping, we give them information on how to correct this behavior.
Our approach has many advantages over traditional machine learning models using hand-crafted features with respect to the aforementioned anti-abuse modeling challenges outlined at the beginning of this blog post.
There are two types of limitations on the amount of usable signal in an anti-abuse model:
Imperfectly exploiting the information that is available in the data: Traditional machine learning models use a limited set of handcrafted features that are tuned for a narrow type of behavior. Handcrafted features are lossy due to the use of aggregations and summary statistics. Moreover, handcrafted features do not scale.
Fundamentally limited information available in the data: As anti-abuse defenses improve for individual accounts, attackers scale their abuse horizontally across many accounts; each account does a small amount of abuse. Additionally, there is often little signal to identify a fake account before it becomes actively abusive, which is in tension with the need to catch abusers as early as possible to limit their damaging impact.
Our activity sequence modeling approach uses deep learning to leverage subtle signals associated with the ordering and timing of member requests. This helps address Limitation 1 above, because the model operates directly on the activity sequence data; we do not lose information via a limited set of lossy handcrafted features. By maximally leveraging the available signal using deep learning instead of manual feature engineering, we also ameliorate the signal limitations due to Limitation 2. We are able to catch abuse earlier, because we are using a technique that allows us to exploit more of the behavioral signal that is fundamentally available in the data.
Addressing adversarial behavior
Bad actors are often quick to adapt and evolve in sophisticated ways. This means we need to build anti-abuse models that are robust to adversarial behavior.
Traditional machine learning models rely on brittle, handcrafted features that can be easy for a bad actor to reverse engineer via feedback from models that take action. Our activity sequence modeling approach is resilient to adversarial behavior, because, as illustrated in Figures 2 and 3, it is very difficult for an abusive automator to craft their traffic in a way that perfectly simulates the organic request patterns of a legitimate member.
Scaling to many attack surfaces
We detect abusive behavior across a diverse set of LinkedIn products with multifaceted attack surfaces. This is a heterogeneous, dynamic environment that requires the use of a scalable, generalizable modeling strategy that supports easy retraining as the infrastructure changes underneath our models.
Traditional machine learning models rely on features crafted to capture specific behavior on each particular attack surface. Often, an AI engineer must craft new features and create a new model to defend against a new type of abusive behavior. In contrast, our activity sequence modeling approach leverages a single, universal tracking event as input data, requires no manual feature engineering, and uses the same model architecture regardless of abuse type.
Our activity sequence modeling approach enables us to use the same input data and model architecture for many use cases; the AI engineer just needs to curate a new set of labels and train the model. This provides us with a common modeling strategy that can be reused across many attack surfaces to save developer time.
LinkedIn’s activity sequence modeling methodology leverages NLP deep learning techniques to classify the sequence of activities a member performs as abusive or not abusive. Our first production use case of this technology was the detection of accounts using automation to scrape member profile data, which is a violation of a member’s privacy expectations and is against LinkedIn’s terms of service.
Our activity sequence modeling technology helps to address several unique challenges in the anti-abuse domain: maximizing our use of the available signal to detect abuse, preventing adversarial attackers from circumnavigating our defenses, and providing a modeling approach that is generalizable and scalable to many attack surfaces.
Special thanks to Ting Chen, Jenelle Bray, Ram Swaminathan, and Romer Rosales for their support of this project. Thank you to Ishan Sinha, Margot Kimura, and Shreyas Nangalia for their product support. Thank you to Milinda Lakkam, David Christle, and Adam Jacoby for their helpful technical input.