The Evolution of Enforcing our Professional Community Policies at Scale

Amit M.

January 16, 2024

LinkedIn is always working hard to make sure that its platform is a safe and trusted place for its members. We've been on a journey to strengthen our platform against abuse by continuously improving our account restriction systems. This helps us ensure that our policies are followed and that our community can keep growing.

In a previous blog post, we talked about how we built our anti-abuse platform using CASAL. This powerful system is our first line of defense against bad actors and adversarial attacks. In this blog post, we'll go deeper into how we manage account restrictions. We'll talk about the changes we've made over the years to keep up with LinkedIn's growth and scale our infrastructure quickly. We'll also share how we manage account restrictions at scale while maintaining our policies and improving the member experience.

Identifying the malicious intent

Identifying malicious intent is at the core of our commitment to ensuring a safe and secure environment for LinkedIn members. As we detailed in our previous blog post, our anti-abuse platform is equipped with a formidable arsenal of tools, including advanced Machine Learning (ML) models, rule-based systems, human review processes, and more. This multi-faceted approach enables us to meticulously evaluate user intent, determining whether it veers into malicious territory or not. When malicious intent is detected, we are swift to respond, employing a range of measures such as imposing challenges to verify authenticity, and in certain cases, restricting a member’s access to the LinkedIn platform. These proactive measures are vital in safeguarding the integrity of our community.

Our approach to restrictions is multi-faceted, mirroring the diverse nature of threats encountered in the ecosystem. We recognize that not all restrictions are created equal. Thus, we have developed a variety of restriction types, each tailored to address specific behaviors and risks posed by bad actors. These measures serve as a strategic defense aimed at minimizing malicious activity's impact and preserving our genuine members' LinkedIn experience.

Image of LinkedIn CASAL & restriction management integration — Figure 1: LinkedIn CASAL & restriction management integration

Evolution of restrictions enforcement

Over the course of LinkedIn’s journey, our commitment to maintaining a secure and seamless experience for our ever-growing member base has led to the evolution of our member restrictions system. As the platform expanded to include over one billion members worldwide, we recognized the need to continually adapt and enhance our mechanism to keep pace with this organic growth. The following sections will take you on a detailed journey spanning years of engineering innovation and thousands of hours of dedicated efforts.

First Generation

In the earlier stages of our development, simplicity was the guiding principle that shaped our approach to member restrictions. At the heart of this system was a reliance on a relational database, Oracle, which served as the repository for all member restrictions data. When we detected that a member’s intent veered into abusive territory, we set the process of imposing restrictions in motion. The fundamental operation was clear: a corresponding record was meticulously created with each new restriction imposed on a member. These records held vital metadata linked to the restriction, including essential timestamps.

Image of the relational database schema — Figure 2: Relational database schema

We adopted a pragmatic and scalable approach by distributing member restrictions across different Oracle tables. This ensured a systematic isolation of different restriction types based on their underlying principles and behavior. These strategic distributions allowed us to leverage the inherent power of relational databases to their fullest potential. Moreover, we dedicated substantial engineering efforts to developing the requisite CRUD (Create, Read, Update, Delete) workflows, ensuring that the lifecycle of these restrictions was managed with precision and efficiency.

As LinkedIn matured and embarked on the transition from a monolithic architecture to a dynamic microservices paradigm through our in-house framework Multiproduct, our commitment to enforcing member restrictions took on even greater significance. We recognized that the scope of enforcement needed to transcend product boundaries, extending across all LinkedIn offerings from the ubiquitous LinkedIn Feed to the specialized LinkedIn Talent Solutions and many more. This unwavering commitment stemmed from our core policies, which mandated strict enforcement. Our mission was clear: to safeguard the platform from infiltrations of malicious actors, ensuring that the experience of our valued members remained uncompromised and of the highest quality.

Server-side cache-aside

As our journey through the restriction enforcement evolution continued, we encountered new challenges arising from the platform’s remarkable growth. The ever-increasing volume of requests directed at our application that’s responsible for serving restrictions through Rest.li data demanded an innovative solution. Enter the introduction of server-side caching, which was an improvement over the earlier solution.

Image of client-side cache for restrictions data — Figure 3. Server-side cache for restrictions data

Our approach included implementing predefined TTL (Time-To-Live) settings tailored to different restriction types. To kickstart this phase, we embraced the cache-aside algorithm. Here’s how it worked: When an incoming request sought member restriction data, we initiated a quick check within the application’s in-memory cache. If the cache held the requested data (a cache hit), we could swiftly return it, eliminating the need for a database query. On the flip side, if the cache lacked the data (a cache miss), we retrieved it from the database while concurrently initiating an asynchronous update to refresh the cache.

Shortcomings: While this server-side cache wasn’t distributed in the traditional sense, it delivered tangible improvements in latency, especially when multiple member requests happened to hit the same application host. Our vigilant team closely monitored the size of the in-memory cache, ensuring that our JVM memory had sufficient capacity. This tedious memory management was essential, given that our application juggled a myriad of operations beyond the realm of serving restrictions. Yet, as our systems grew and their intricacies of restriction enforcement deepened, it became apparent that further enhancements were needed. Enter client-side caching, a solution better suited for scenarios where cache-hit ratios held a higher promise, beckoning us toward even greater performance gains.

Client-side cache

With server-side caching providing a temporary boost in our system’s performance, we acknowledged its inherent limitations and set our sights on the next phase of evolution. The natural progression was the implementation of client-side caching strategies, a pivotal move that would redefine our approach to restriction enforcement. Our ambition extended to all our upstream applications, encompassing LinkedIn Feed, LTS, etc. Our mission was to develop a versatile client-side library capable of initiating and maintaining caches directly on the client-side application hosts, closely aligning with the principles of our tried-and-tested server-side cache-aside algorithms.

Shortcomings: The introduction of dual layers, encompassing server-side and client-side caching, resulted in significant enhancements, particularly in scenarios with high cache-hit rates. While our caching initiatives successfully reduced latency and improved system responsiveness, they also brought forth new challenges that emphasized the importance of maintaining cache consistency. Periodic latency spikes become a concern, primarily due to the absence of distributed cache strategies within our existing setup and missing records. Our cache framework primarily operated in isolation at the application host level, a configuration that posed specific challenges related to consistency and performance.

In the light of these challenges, we wanted to refine our cache strategies and fortify our restriction enforcement, paving the way for innovative solutions capable of navigating the complexities. We had to think proactively and refine our cache strategies further.

Full refresh-ahead cache

In our quest for even more efficient restriction enforcement, we ventured into the realm of the full refresh-ahead cache, a concept designed to enhance our system’s performance further. The premise was intriguing - each client application host would diligently store all restriction data in its in-memory cache. This architecture was meticulously structured, with each client-side application host tasked with reading all the restriction records from our servers during every restart or boot-up. The objective was to ensure that every single member restriction record resided in the cache of each application host.

The advantages of this approach were clear. We witnessed a remarkable improvement in latencies, primarily because all member restriction data was readily available on the client side, sparing the need for network calls. We implemented a polling mechanism within our library to maintain cache freshness, enabling regular checks for newer member restriction records and keeping the client-side cache impeccably up-to-date.

Shortcomings: However, as with any mechanism, there were trade-offs. The full refresh-ahead cache approach posed its own set of challenges. Each client was burdened with maintaining a substantial in-memory footprint to accommodate the entire member restriction data. While the data size itself was manageable, the demands of client-side memory were significant. Moreover, this architecture introduced a burst of network traffic during application restarts or the deployment of new changesets, potentially straining our infrastructure.

Additionally, as the memory on the client side did not persist, performing a full refresh-ahead cache became a resource-intensive and time-consuming operation. Each host was required to undertake the same process, exerting substantial strain on our underlying Oracle database. This strain manifested in various ways, from performance bottlenecks and increased latencies to elevated CPU usage, imposing substantial operational overhead on our teams. The challenges persisted despite our efforts to fine-tune the Oracle database, including adding indexes and optimizing SQL queries.
Furthermore, the architecture’s Achilles’ heel lay in cache inconsistencies. Maintaining perfect synchronization between the cache and the server was a demanding task requiring precise data fetching and storage. Any failures, even after limited retries, could result in some records being missed, often due to factors like network failures or unexpected errors. These inconsistencies were a cause of concern, hinting at potential gaps in our ability to enforce all restrictions consistently.

Bloom-Filter

In our relentless pursuit of optimization, we ventured into the realm of Bloom-Filters (BF). BF proved to be a transformative addition to our toolkit, offering a novel approach to storing and serving restrictions at scale. Unlike conventional caching mechanisms, BF offered a unique advantage: they allowed us to determine, with remarkable efficiency, whether a given member restriction was present in the filter or not.

The brilliance of BF lies in their ability to represent large datasets of restrictions succinctly. Rather than storing the complete restriction dataset, we harnessed the power of BF to encode these restrictions in a compact and highly efficient manner. This meant that, unlike traditional caching, which could quickly consume substantial in-memory space, we could conserve valuable resources while maintaining rapid access to restriction information. The essence of a BF is its probabilistic nature; it excels in rapidly answering queries about the potential presence of an element, albeit with a small possibility of producing false positives. Capitalizing on this probabilistic quality, we could swiftly and accurately assess whether a member’s restriction was among the encoded data. We could proceed with the required restriction action when the BF signaled a positive match. This approach streamlined our system and allowed us to maintain a lean memory footprint, an essential consideration as our platform continued to scale to accommodate the needs of millions of LinkedIn members worldwide.

Shortcomings: The adoption of BF marked an enhancement in our pursuit of scalability. Instead of merely enlarging our memory capacity to accommodate an ever-expanding list of restrictions, we embraced an ingenious method that allowed us to efficiently manage and access these restrictions without the burden of excessive memory requirements. BF's probabilistic nature introduced a trade-off: it provided incredibly fast queries but, in rare instances, might return a false positive. However, this was a minor concession for our use case, where rapid identification of restrictive conditions was paramount. BF became the linchpin of our scalable restriction enforcement strategy, delivering swift and precise results while optimizing resource allocation, making it a crucial addition to our evolving arsenal of tools and techniques.

Second Generation

LinkedIn’s remarkable growth brought both opportunities and challenges. As our member base expanded, we encountered operational complexities in managing a system that could efficiently enforce member restrictions across all product surfaces. System outages and inconsistencies became unwelcome companions on our journey. Faced with these challenges, we reached a crucial juncture that demanded a fundamental overhaul of our entire system.

Our journey towards transformation centered on a set of principles, each geared towards optimizing our system:

Support high QPS (4-5 Million QPS): The need to handle restriction enforcement for every request across Linkedin’s extensive product offerings called for a system capable of sustaining a high QPS rate.
Ultra low latency (<5 ms): Ensuring a swift response time, with latency consistently under 5 milliseconds, was imperative to uphold the member experience across all workflows.
Five 9’s availability: We committed to achieving 99.999% availability, ensuring that every restriction was enforced without exception, safeguarding a seamless experience for our valued members.
Low operational overhead: To optimize resource allocation, we aimed to minimize the operational complexity associated with maintaining system availability and consistency.
Small delay in time to enforce: We set stringent benchmarks for minimizing the time it took to enforce restrictions after they were initiated.

With these principles are our North Star, we began to redesign our architecture from the ground up.

Diagram of LinkedIn restriction enforcement system (2nd generation) — Figure 5. LinkedIn restriction enforcement system (2nd generation)

First, we migrated all member restrictions data to Espresso, LinkedIn’s custom-built NoSQL distributed document storage solution. This strategic move streamlined our data management. Every time a new member restriction record was created, a corresponding Espresso document with the MemberId as the key was generated. Espresso’s unique capability to emit Kafka messages containing the new document data and metadata played a pivotal role in ensuring data freshness and synchronization (LinkedIn’s Oracle integration solution did not have this capability at the time of designing this, but this was later introduced).

Second, we made a strategic decision based on the CAP principle (Consistency, Availability, Partition Tolerance) to prioritize consistency and availability while sacrificing partitioning. Our past experiences with partitioned databases informed this choice, which had proven to introduce latencies that didn’t align with our stringent latency goals. This decision and the absence of a chasing solution like Redis or Couchbase at the time in LinkedIn shaped our architectural direction.

Diagram of the CAP theorem — Figure 6. CAP theorem

In this new design, each server application host would bootstrap all member restriction records from the Espresso database tables. Espresso’s tight integration with LinkedIn’s Brooklin–a near real-time data streaming framework–enabled seamless data streaming through Kafka messages. Once all existing records were processed, we transitioned to listening to newer Kafka messages capturing the recently created restriction records, ensuring the availability and freshness of data across all hosts.

This architecture delivered ultra-low latencies, eliminating the need for downstream calls to databases, relying instead on internal in-memory lookups. We conducted extensive benchmarking and optimized our data structures, selecting the FastUtil collections framework for its promising results in memory utilization and lookups. With no TTL requirements for these records, operations related to TTL expiry and removal of restriction records occurred on the Espresso database, emitting Kafka messages to trigger corresponding server application actions, thus ensuring immediate and consistent enforcement.

Our journey also involved horizontal scaling across LinkedIn data centers, allowing us to support high QPS. We further fine-tuned each host (JVM, GC, etc.) to handle very high QPS through optimizations and internal enhancements. With this transformation effort being rolled out seamlessly across our 100+ clients, we saved over 16TB+ of in-memory savings from our 1st generation design of allocated client-side and server-side caching in palace.

Shortcomings: However, one caveat re-emerged: bootstrapping all records during server restarts or deployments presented a bottleneck. Despite optimization efforts, this process could take 30+ minutes, posing a challenge in urgent situations. This drawback called for innovation to address the issue and minimize potential downtime.

Third Generation

The second generation of our architecture served us admirably for over five years, scaling in lockstep with LinkedIn’s own remarkable growth while maintaining robust restriction enforcement. However, as the platform evolved, we encountered challenges, particularly during GC (Garbage Collection) cycles that occasionally led to increased latencies. To tackle this, we engaged closely with the LinkedIn Java team to fine-tune our JVM applications, refining parameters and optimizing internal data structures and code to mitigate latency spikes.

A pivotal incident occurred when the volume of records organically swelled, accompanied by a surge in adversarial attacks on the LinkedIn platform. This surge resulted in a notable increase in restrictions. While we had allocated ample JVM heap space for each server host, some hosts surpassed the FastUtil-based Java HashMap’s load factor threshold. This breach prompted resizing this data structure, effectively doubling its capacity. Unfortunately, this expansion pushed the map data structure to the upper limits of the allocated JVM heap space, triggering GC events and culminating in elevated latencies.

This criticality of our system to LinkedIn’s operation cannot be overstated. Any rise in latencies or downtime carries the potential to impact all LinkedIn product surfaces significantly, diminishing the member experience. Thus, we embarked on a fresh journey to redesign our system, adhering to the principles that had guided us before and adding new ones to address the evolving landscape:

Faster data bootstrap: We aimed to expedite the processing of all restriction data to accommodate organic growth and escalating adversarial attacks effectively.
Migrate to off-heap memory: Leveraging the off-heap memory bucket was imperative, ensuring optimal resource utilization.

In pursuit of optimization, we leveraged cutting-edge LinkedIn technologies and innovation, leading to a refined system architecture like the following:

Diagram of restriction enforcement system using Venice - DaVinci client — Figure 7: Restriction enforcement system using Venice - DaVinci client

Recognizing the urgency and criticality of the task, we conducted benchmark experiments in a time-bound manner. Speed and precision were paramount, given the inherent risks we faced. Our LinkedIn Venice team introduced the “DaVinci” framework – an ingenious client library that rediened cost-performance trade-offs. DaVinci operates as an “eager cache,” enabling each server host to process and store all restriction data in memory through Kafka messages, replicating some part of the 2nd generation architecture functionality. Notably, we refrained from partitioning our data, a deliberate choice to minimize lookups and align with our stringent ultra-low latency requirements.

Furthermore, we revamped our data structures, adopting a more bitset-like approach to shrink our in-memory footprint further, thereby mitigating the risk of reaching the upper scaling limit for our hosts. This meticulous re-architecting effort has instilled us with profound confidence, ensuring our ability to navigate the foreseeable future, even in the face of organic and inorganic growth in restriction data.

Learnings

Over the years, across multiple team members and our partner teams at LinkedIn, this journey has been marked by critical learnings that have informed our strategies, decisions, and innovations.

Start simple, scale thoughtfully: Complexity can hinder scalability. Start with simple solutions and scale thoughtfully. This approach prevents unnecessary complexity ensuring streamlined development.
Humility and proactivity: Acknowledging system limits and proactive identifying design gaps are essential to maintain system robustness during growth.
Collaboration drives efficiency: Collaborating with different teams enhances knowledge sharing, accelerates development, and reduces redundant efforts.
Benchmark and experiment: Rigorous benchmarking and time-bound experimentation enable quick adaptation to evolving challenges.
Continuous improvement: Ongoing optimization is essential to adapt to changing circumstances and ensure system resilience.

Conclusion

At LinkedIn, we're dedicated to keeping our community protected. We take a comprehensive approach to creating a safe and enjoyable experience for our members. Our anti-abuse platform uses advanced ML models and human expertise to quickly identify and address bad actors. By thoughtfully implementing a range of restrictions, we deter bad actors and make our platform more resilient, ultimately keeping the LinkedIn community thriving and secure. This new approach demonstrates the power of adaptability, innovation, and collaboration in ensuring positive experiences for our members.

Acknowledgments

We would like to extend our heartfelt acknowledgments to the dedicated team who have been the driving force behind this success. This is a massive collaborative effort of our core team, comprising brilliant minds from various domains, including Anti-Abuse Infra, Trust SRE, Venice, LinkedIn Java performance team and beyond.

I would also like to extend thanks to Xiaofeng Wu, Shane Ma, Hongyi Zhang, James Verbus, Jithender Reddy Malladi, Apurv Kumar, Nick Hernandez, Katherine Vaiente, Francisco Tobon, Will Cheng, and Leonna Spilman for helping to review this blog and providing valuable feedback.

Topics: Data modeling Distributed Systems Infrastructure