(Re)building Threat Detection and Incident Response at LinkedIn
November 9, 2022
Co-authors: Sagar Shah and Jeff Bollinger
LinkedIn connects and empowers more than 875 million members and over the past few years, has undergone tremendous growth. As an integral part of the Information Security organization at LinkedIn, the Threat Detection and Incident Response team (aka SEEK) defends LinkedIn against computer security threats. As we continue to experience a rapid growth trajectory, the SEEK team decided to reimagine its capabilities and the scale of its monitoring and response solutions. What SEEK set out to do was akin to shooting for the moon, so we named the program “Moonbase.” Moonbase set the stage for significantly more mature capabilities that ultimately led to some impressive results. With Moonbase, we were able to reduce incident investigation times by 50%, increase threat detection coverage expansion by 900%, and reduce our time to detect and contain security incidents from weeks or days to hours.
In this blog, we will discuss how LinkedIn rebuilt its security operations platform and teams, scaled to protect nearly 20,000 employees and more than 875 million members, and our approach and strategy to achieve this objective. In subsequent posts, we will do a deep dive into how we built and scaled threat detection, improved asset visibility, and enhanced our ability to respond to incidents within minutes, with many lessons learned along the way.
Software-defined Security Operations Center (SOC)
While these data points demonstrated some of the return on the investment, they didn’t show how much scaling to Moonbase improved the quality of life for our security analysts and engineers. With Moonbase, we were able to eliminate the need to manually search through CSV files on a remote file server, or the need to download logs and other datasets directly for processing and searching. The move to a software-defined and cloud-centric security operations center accelerated the team's ability to analyze more data in more detail, while reducing the toil of manual acquisition, transformation, and exploration of data.
Having scaled other initiatives in the past, we knew before we started the program rebuild we would need strong guiding principles to keep us on track, within scope, and set reasonable expectations.
Preserving Human Capital
As we thought about pursuing a Software-defined SOC framework for our threat detection and incident response program, preserving our human capital was one of the main driving factors. This principle continues to drive the team, leading to outcomes such as reduced toil through automation, maximized true positives, tight collaboration between our detection engineers and incident responders, and semi-automated triage and investigation activities.
It’s commonly said that security is everyone’s responsibility, and that includes threat detection and incident response. In our experience, centralizing all responsibilities of threat detection and incident response within a single team restricts progress. Democratization can be a force multiplier and a catalyst for rapid progress if approached pragmatically. For example, we developed a user attestation platform where a user is informed about suspicious activity, provided context around why we think it is suspicious, and asked a question of whether they recognize the suspicious activity. Depending on the user’s response and circumstantial factors, a workflow is triggered that could lead to an incident being opened for investigation. This helped reduce toil and the time to contain threats by offering an immediate response to an unusual activity. Democratization has been applied to several other use cases with varying degrees of success from gaining visibility to gathering threat detection ideas.
Building for the future while addressing the present
The rebuilding of the threat detection and incident response program was done while still running the existing operations. With this approach, we were able to carve out space among the team to work on more strategic initiatives.
Security, scalability, and reliability of infrastructure
As the team increased its visibility tenfold, the demand for data normalization and searching continued to grow. Our platform, tooling, data, and processes need reliability and scalability in the most critical times, like during an incident or an active attack. The team ensured a focus on resiliency in the face of setbacks such as software bugs or system failures. To get ahead of potential problems, we committed to planning for failure, early warning, and recovery states to ensure our critical detection and response systems were available when most needed. Security of the LinkedIn platform is at the forefront of everything the team does and is etched into our thought process as we build these systems.
As we started thinking about the broad problem space we needed to address, a few fundamental questions came up.
The first was, “What visibility would the threat detection and incident response team need to be effective?” This fundamental question helped us shape our thinking and requirements about how to rebuild the function. There are several ways to approach building out an operational security response and detection engineering team. Whether that’s an incident response firm on retainer, an in-house security operations center, a managed service provider, or a combination of both, there are plenty of approaches that work for many organizations. At LinkedIn, we wanted to work with what we already had, which was a great engineering team and culture, the ability to take some intelligent risks, and the support of our peers to help build and maintain the pipeline and platform we needed to be effective.
The next question that we asked was, “How do we provide coverage for the threats affecting our infrastructure and users?” There are many areas that require attention when building out a Software-defined SOC and it can be difficult to know where to prioritize your efforts. The long-term goal was inspired by the original intent of the SOCless concept, which suggests that mostly all incident handling would be automated with a few human checkpoints to ensure quality and accuracy, paired with in-depth investigations as necessary. Given our team's skills and development-focused culture, the benefit of reducing the need for human interaction in security monitoring operations was an attractive idea. With this, we needed to build and maintain development, logging, and alerting pipelines, decide what data sources to consume, and decide what detections to prioritize.
“What are the most important assets we need to protect?” was the third question we asked. Due to the scope of Moonbase, we had to decide how to deliver the biggest impact on security before we had designed the new processes or completed the deployment. This meant we focused on the most important assets first, commonly known as “Crown Jewels.” Starting with systems we knew and understood, we could more easily test our detection pipeline. Many of the early onboarded data sources and subsequent detections were from the Crown Jewels. Ultimately, this was a quick start for us, but did not yet offer the comprehensive visibility and detection capabilities we needed.
This leads us to the last question we asked, “How can we improve the lives of our incident responders with a small team?” Early detections in the new system, while eventually iterated or tuned, led to classic analyst fatigue. To ease the burden and improve the lives of the responders, we built a simple tuning request functionality, which enables quick feedback and the potential to pause on a lower-quality detection. This principle has enabled us to maintain a reasonable expectation of results from analysts while reducing the potential for additional fatigue, alert overload, and job dissatisfaction. Additionally, we have focused on decentralizing the early phases of the security monitoring process, which has led to significantly less toil and investigation required from analysts. When a class of alerts can be sent with additional context to potential victims or sources of the alert, with the proper failsafes in place the analyst can focus on responding to the true threats. Another example is directly notifying a system owner or responsible team whenever there are unusual or high-risk activities such as a new administrator, new two-factor accounts, credential changes, etc. If the activity is normal, expected, or a scheduled administration activity that can be confirmed by the system owners, again with the proper failsafes, there is no need to directly notify the incident response team. Tickets generated from security alerts are still created and logged for analysis and quality assurance, but reviewing a select number of tickets periodically for accuracy is preferred to manually communicating through cases that likely have no impact or threat.
Phase 1: Foundation and visibility
The first phase of rebuilding our security platform was our end-to-end infrastructure and process development. We already had a small team and some data sources, including some security-specific telemetry (endpoint and network detection technologies), so we focused on finding the right platform first and then on building strong processes and infrastructure around it. Given the rapidly growing business, a lean team, and other constraints, a few base requirements were established for the foundational platform(s).
Low operational overhead
This enables the team to focus on the analysis tasks meant for humans and lets automation do the work of data gathering and transformation. Constantly working through the critical, yet laborious extract, transform, and load processes while trying to hunt for and respond to threats is a quick recipe for burnout.
Built-in scalability and reliability
Reliability is a naturally critical component in any security system, however, with a small team, it must be low maintenance and easy to scale. Additionally, time invested should be primarily focused on building on top of the platform, rather than keeping the platform operational. This is one of the strongest arguments for a cloud solution, as it allows us to work with our internal IT partners and cloud service providers. Through this, we can ensure a reliable backbone for our pipelines and alerting, along with other programs that coordinate these processes to gather data and context.
Rapid time to value
Building a functioning operational team takes time and practice, so when choosing the technology toolset, we need to see results quickly to focus on process improvement and developing other important capabilities.
To minimize context switching and inefficiencies, we want the data, context, and all entity information to be available in our alerting and monitoring system. This naturally leads toward a log management system and security information and event management (SIEM) overlays.
By the end of this phase, we wanted to be able to onboard a handful of data sources to a collaborative platform where we could run searches, write detections, and automatically respond to incidents. This was the beginning of our methodical approach to deploying detections through a traditional CI/CD pipeline.
Capturing data at scale, and in a heterogeneous environment, is a huge challenge on its own. Not all sources can use the same pipelines, and developing and maintaining hundreds of data pipelines is too much administrative overhead for a small team. We started our search for a solution internally and decided on a hybrid solution. LinkedIn already has a robust data streaming and processing infrastructure in the form of Kafka and Samza, which ships trillions of messages (17 Trillion messages per day!) and processes hundreds of petabytes of data. Most of the LinkedIn production and supporting infrastructure is already capable of transporting data through these pipelines, which made them an attractive early target. However, LinkedIn has organically grown and between acquisitions, software-as-a-service applications, different cloud platforms, and other factors, there needed to be other supported and reliable modes of transport. After analysis, the team developed four strategic modes of data collection including the ultimate fallback of REST APIs provided by the SIEM.
A simplified diagram of data collection pipelines
Most of our infrastructure is already capable of writing to Kafka. With reliability and scalability in mind, Kafka was a perfect choice for a primary data collection medium.
Infrastructure, like firewalls, is not capable of writing to Kafka inherently. We operate a cluster of Syslog collectors for anything that supports Syslog export, but not Kafka messages.
Serverless data pipelines
These are employed mostly for collecting logs from SaaS, PaaS, and other cloud platforms.
The data collector REST API is the collection mechanism natively supported by the SIEM for accepting logs and storing them against known schemas. This is currently the most commonly used transport mechanism, scaling to hundreds of terabytes.
Security infrastructure as code
The team has deep experience with security tooling and platform over the years. As we rebuilt our foundational infrastructure, we knew we needed to apply a more comprehensive engineering first approach. One aspect of this engineering first approach to defense infrastructure was treating everything as code to maintain consistency, reliability, and quality. This led to the development of the Moonbase continuous integration and continuous deployment (CI/CD) pipeline. Through the CI/CD pipeline the team manages all detections, data source definitions and parsers, automated playbooks and serverless functions, and Jupyter notebooks used for investigations, etc.
Having every engineer work on the development and improvement of the detection playbooks, as well as having the applied rigor that comes from the typical CI/CD review and testing stages, gives the team a strong, templateable, and heavily quality-assured playbook for detecting and responding to threats. Simple mistakes or detections that could lead to unnecessary resource usage are easily prevented through the peer review process. Our tailored comprehensive CI validations for each resource type help us to programmatically detect any issues in these artifacts within the PR validation process and improve the deployment success rate significantly. Change tracking is also much easier as pull request IDs can be added to investigation tickets for reference or used in other parts of the tuning process.
The Moonbase CI/CD pipeline is serverless, built on top of Azure DevOps. Azure Repos is a source code management solution similar to Github that we use for all our code with deployment done through Azure Pipelines. Azure Pipelines is a robust CI/CD platform that supports multi-stage deployments, integration with Azure CLI tasks, integration with Azure Repos, PR-triggered CI builds, etc. This also helps us deploy the same resource to multiple SIEM instances within different scopes only by updating deployment configuration settings. We leverage both to build, validate, and deploy detections and all other deployables, following a trunk-based development model. Artifacts like queries are enriched in the pipeline before deployment. These enrichments help not only with detections but also help track threat detection coverage, metrics for incidents, etc.
While there are many features of a CI/CD pipeline like constant validation, decentralized review, mandatory documentation, and metrics, the templating aspect is one of the stronger points of this detection engineering approach. The pipeline allows any analyst on the team to quickly deploy a new detection or update an existing one. In this screenshot from VSCode you can see an example detection template looking for unexpected activity from inactive service principals (SPN).
The pane on the right shows the actual detection within its template. The entire detection isn’t pictured here to obscure some internal-only information, but this snippet shows what a detection engineer has configured for this specific detection in the orange-colored text. Other items, like how far back to search and the severity of the alert, can be specified in the template.
Additionally, we explicitly configured the data sources needed to execute the detection within the template.
To assist in understanding our coverage of threats, we map each detection to its MITRE ATT&CK ID and Sub ID. Not only does this help us track our coverage, but it also enables us to write additional queries or detections that collate any technique or class of attacks into a single report.
Finally, the query itself is listed (in this case we’re using KQL to search the data).
Our internally-developed CLI generates boilerplate templates for new artifacts and helps engineers maintain quality and their deployment velocity and improves productivity by helping engineers validate their changes locally.
It is important to create space for innovation. Earlier, the team often found themselves deep in operational toil, going from one issue to another. A very deliberate effort to pause operational work, which yielded a low return on investment, really helped the team make space for innovation. Staying true to the guiding principles of automating what makes sense, the team uses automation heavily to remove as much of the mundane and repeatable work as possible. The following are some automation tools and platforms that the team currently uses. These platforms and the automation have enabled the team to unlock efficiency and quality across the functions.
Automated playbooks and workflow automation
The team leverages a no-code workflow automation platform that comes tightly integrated with the cloud SIEM. It is a highly scalable integration platform that allows building solutions quickly due to the many built-in connectors across the product ecosystem like Jira, ServiceNow, and several custom tools that the team depends on. Some use cases include alert and incident enrichment, which makes all context required to triage an alert available to the engineers, and running automated post-alert playbooks for prioritization and running automated containment and remediation jobs, and other business workflows. Another use case is automated quality control and post-incident processing, which allows us to learn lessons from previous incidents.
Several complex automation jobs are written as serverless functions. These are easy to deploy, scalable, and have a very low operational overhead. These functions are used for ingesting data from on-prem and online sources, along with more complex automation jobs like the containment of compromised identities.
Resilience and reliability are broad topics, and thus not covered in this blog, however, data and platform reliability are absolutely critical. A change in underlying data can have major cascading effects on detection and response. Data availability during incidents is key, too. Outside of the core monitoring of the platforms, the team relied on three different avenues for signals of degraded performance:
Looking at things like message drop rate, latency, and volume allows the team to quickly identify any issues before they impact detections and/or response activities.
Sending messages to ensure the pipeline is functional and that messages get from source to destination within expected timeframes.
Ideal behavior indicators (operational alerts)
Data Sources are dynamic in nature. When onboarding a datasource, a detection is developed to monitor the datasource health. For example, sending an alert when the number of unique source hosts decreases more than the threshold percentage.
These are only some of the health checks and alerts we developed to ensure that our systems were logging and reporting properly. We try to graph availability as well as detection efficacy data to assist us in constantly re-evaluating our detection posture and accuracy.
What did we get for all this?
Threat detection and incident response is not a purely operational problem, using an engineering lens paid off well for our team. If this is not something you're doing, we hope this post drives you toward that realization. Leaving the processes to organic growth is unsustainable and violates our guiding principles designed to prevent burnout for the team and ensure a quality response. In the end, we achieved significant success and improvements. We were able to expand our data collection infrastructure by 10x going from gigabytes to petabytes, our average time to triage incidents went from hours to minutes, we were able to maintain 99.99% uptime for the platform and connected tooling, and correlation was now possible, significantly reducing alert fatigue and improving precision. Additionally, automated response playbooks allowed us to reduce toil for response and automatically handle simple tasks like enrichment, case creation and updates, or additional evidence gathering. We were also able to quickly integrate threat intelligence into our enrichment pipeline, providing a much richer context for investigations.
More work to do
What we’ve covered in this blog represents the work of a handful of engineers, product and project managers, analysts, and developers over a relatively short period of time. We learned many valuable lessons over time and have since developed new ways of thinking about detection, automated response, and scaling. Whatever platform a team decides on using for security monitoring, it cannot be overstated how important a solid design and architecture will be to its eventual success.
In a future post, we’ll cover more details on how incident response ties in with our platforms and pipelines and a re-architected approach to detection engineering, including the lessons we learned.
This work would not have been possible without the talented team of people supporting our success.
Tech lead and architect: Sagar Shah
Technical program manager: Jacquie Bradley
Core contributors: Vishal Mujumdar, Amir Jalali, Alex Harding, Tom Leahy, Tanvi Kolte, Arthur Kao, Swathi Chandrasekar, Erik Miyake, Prateek Jain, Lalith Machiraju, Gaurav Gupta, Brian Pierini, Sergei Rousakov, Jeff Bollinger and several other partners within the organization.