Defending Against Abuse at LinkedIn’s Scale

Sahil Handa

Engineering Leadership at Databricks

December 6, 2018

LinkedIn is committed to building a safe, trusted, and professional environment. Building the infrastructure to detect and mitigate abuse at LinkedIn’s scale brings with it a number of interesting challenges that require us to often think outside the box and craft new ways of integrating defenses throughout our product. In this post, we’ll provide a high-level overview of how we approach these challenges when building the infrastructure to help stop abuse on the world’s largest professional network.

Currently, LinkedIn has more than 590 million members in over 200 countries and territories. Professionals are signing up to join LinkedIn at a rate of more than two new members per second. While the majority of these sign ups are from legitimate users, we also block a large number of attempts to register fake accounts.

In a previous blog post, we described the funnel of defenses we have to detect and take down fake accounts. In summary, while we prevent a large majority of fake accounts from being created at registration, we sometimes don’t have enough information at that point to determine if accounts are fake. For this reason, we have other, downstream models to catch smaller batches of fakes. As we designed these models, we knew that we needed to build scalable systems with the ability to process, classify, and act efficiently before malicious actors were able to impact other members on our platform.

Currently, our abuse detection systems process over 4 million transactions per second to make decisions in real-time using machine learned models and rules based on statistical methods. A number of challenges make this particularly hard to do, especially at the scale at which LinkedIn operates. These include:

Abuse scoring latency
Downstream latency and service errors
Velocity and frequency at scale
Abuse protection by default

Below, we’ll briefly talk about each of these challenges, along with our high-level approach to overcome them.

Latency

A basic tenet of our protection engineering approach is to stay invisible to the product verticals like feed, profile, and search, and to keep the impact to LinkedIn members at a minimum. This means that we have to utilize mechanisms to detect and filter abuse without adversely impacting any product flow. To accomplish this, we try to do most of our heavy processing nearline or offline, but there are a number of instances where we must check for abuse synchronously.

As a result, building very low latency services is an operational necessity for our infrastructure. We deliberately follow a “performance first” approach when designing critical infrastructure pieces. This has allowed us to build RESTful services that can process 10,000 QPS on a single 24 core machine while also serving 99% of the traffic in under 1ms. The following are some design and implementation choices we experimented with to achieve the best possible performance:

Parallel processing versus sequential processing
Using Primitive versus Non-Primitive collections
Using thread local objects for processing instead of creating new objects
Avoiding unnecessary serialization and deserialization
Optimistic locking to avoid lock contention under high-volume traffic

Since some of our systems process transactions in excess of 1M QPS, we have also invested in understanding and tuning performance for the 99.99 percentile of our traffic. We will describe our performance-driven development approach in more detail in a future blog post.

Downstream latency and errors

Our abuse scoring flows require a lot of knowledge about the member or transaction that we are classifying as good or bad. Such knowledge is acquired by querying a fleet of downstream systems and databases. A failure or high latency in any of these downstreams can mean that we can’t meet the service level object of our abuse scoring system.

When these errors happen, we have three strategies that we choose from on a case-by-case basis by weighing risk of abuse against friction introduced to good members:

Fail the original user transaction. This puts abuse scoring in a critical path for that transaction.
Re-run the scoring asynchronously. This allows us to run the full scoring suite again after unblocking the user transaction first.
Gracefully degrade scoring by using default values. This makes the abuse defense slightly weaker for that transaction.

Velocity and frequency at scale

A common and basic approach to fighting some forms of abuse is based on the velocity or frequency of a certain action. This is a widely-deployed approach across the industry but is hard and expensive to build at scale.

Counting the number of HTTP requests a user makes requires a massive amount of compute power and storage. This adds up quickly when you take into account the amount of user-generated traffic, the types of actions, and the number of entities we keep track of. At LinkedIn, we maintain hundreds of different counters for actions taken by any entity, like a user ID or an IP address. At any given time, we have more than:

5B transient counter records
200K QPS writes
3M QPS reads

Sometimes these counters are used for setting up raw thresholds for certain types of actions, but more often they’re used for building comprehensive abuse models like the ATO detection model described here.

Abuse protection by default

Abuse is hard to plan for, as malicious actors are constantly trying to figure out how to monetize different aspects of our platform. As a result, we needed to build a flexible abuse prevention infrastructure by default into all possible entry points. Protections could then be quickly setup to combat an attack and add friction to any flow if necessary. This is challenging at a large company like LinkedIn, however, because thousands of developers are simultaneously creating new experiences and product flows across our site all the time. To solve for this, we built a suite of integrations that give us:

Unified protections at our edge layer: This helps us take action against abuse on any endpoint, but in order to make fast decisions, we only use the data in the body of the request to make a decision. Preventing transactions at the edge layer allows us to mitigate any unnecessary traffic downstream.
Unified content filtering: This allows us to easily scale our defenses as LinkedIn launches new types of content for our members. We provide a single library that the different services at LinkedIn can integrate with to enable a comprehensive suite of protections against things like spam, malware, inappropriate content, and malicious URLs.
Custom integrations for high-risk endpoints: This allow us to build very custom models against known attack vectors on various endpoints including registration, login, content creation, and payments.
Self-serve protections: This helps us scale the impact of our central abuse team to the hundreds of services that exist at LinkedIn. Self-serve protections like throttling can be configured and setup by SRE teams to protect services from unexpectedly high volumes of traffic.

Given LinkedIn's scale, we are often met with a number of interesting challenges to continuously protect our members. We hope the high level ways we think about and address these problems are useful to others in the industry if and when they are met with similar issues over time.

Acknowledgements

Fighting against abuse is a constant multi-team effort. Above all else, tight collaboration between a number of different business functions is key. I want to thank the continued efforts of the Abuse Prevention Infrastructure, Abuse Research and Response, Abuse Relevance, Trust and Safety, Trust Analytics, and House Security teams in keeping our site safe.

I also want to thank Jenelle Bray, Carlos Faham, Tzu-Han Jan, Xianheng (Shane) Ma, Stephen Lynch, Anne Trapasso, MK Juric, Cory Scott, and Nikolai Avteniev for their help in writing this blog post.

Topics: Security