Real-time analytics on network flow data with Apache Pinot

September 13, 2022

The LinkedIn infrastructure has thousands of services serving millions of queries per second. At this scale, having tools that provide observability into the LinkedIn infrastructure is imperative to ensure that issues in our infrastructure are quickly detected, diagnosed, and remediated. This level of visibility helps prevent the occurrence of outages so we can deliver the best experience for our members. To provide observability, there are various data points that need to be collected, such as metrics, events, logs, and flows. Once collected, the data points can then be processed and made available, in real-time, for engineers to use for alerting, troubleshooting, capacity planning, and other operations.

At LinkedIn, we developed InFlow to provide observability into network flows. A network flow describes the movement of a packet through a network and is the metadata of a packet sampled at a network device that describes the packet in terms of the 5-tuple: source IP, source port, destination IP, destination port, and protocol. It may also contain source and destination autonomous system numbers (ASNs), the IP address of the network device that has captured this flow, input and output interface indices of the network device where the traffic was sampled, and the number of bytes transferred.

Network devices can be configured to export this information to an external collector using various protocols. InFlow understands the industry standard sFlow and IPFIX protocols for collecting flows.

How LinkedIn leverages flow data

InFlow provides a rich set of time-series network data having over 50 dimensions such as source and destination sites, security zones, ASNs, IP address type, and protocol. With this data, various types of analytical queries can be run to get meaningful insights about network health and characteristics.

Figure 1. A screenshot from InFlow UI’s Top Services tab which shows the 5 services consuming the most network bandwidth and the variation of this traffic over the last 2 hours

Most commonly, InFlow is used for operational troubleshooting to get complete visibility into the traffic. For example, if there is an outage due to a network link capacity exhaustion, InFlow can be used to find out the top talkers for that link based on hosts/services that are consuming the most bandwidth (Figure 1) and based on the nature of the service, further steps can be taken to remediate the issue.

Flow data also provides source and destination ASN information, which can be used for optimizing cost, based on bandwidth consumption of different kinds of peering with external networks. It can also be used for analyzing data based on several dimensions for network operations. For example, finding the distribution of traffic by IPv4 or IPv6 flows or the distribution of traffic based on Type of Service (ToS) bits.

InFlow architecture overview

Figure 2. InFlow architecture

Figure 2 shows the overall InFlow architecture. The platform is divided into 3 main components: flow collector, flow enricher, and InFlow API with Pinot as a storage system. Each component has been modeled as an independent microservice to provide the following benefits:

It enforces the single responsibility principle and prevents the system from becoming a monolith.
Each of the components have different requirements in terms of scaling. Separate microservices ensure that each can be scaled independently.
This architecture creates loosely coupled pluggable services which can be reused for other scenarios.

Flow collection

InFlow receives 50k flows per second from over 100 different network devices on the LinkedIn backbone and edge devices. InFlow supports sFlow and IPFIX as protocols for collecting flows from network devices. This is based on the device’s vendor support for the protocols and minimal impact of flow export on the device’s performance. The InFlow collector receives and parses these incoming flows, aggregates the data into unique flows for a minute, and pushes them to a Kafka topic for raw flows.

Flow enrichment

The data processing pipeline for InFlow leverages Apache Kafka and Apache Samza for stream processing of incoming flow events. Our streaming pipeline processes 50k messages per second, enriching the data with 40 additional fields (like service, source and destination sites, security zones, ASNs, and IP address type), which are fetched from various internal services at LinkedIn. For example, our data center infrastructure management system, InOps, provides information on the site, security zone, security domain of the source, and destination IPs for a flow. The incoming raw flow messages are consumed by a stream processing job on Samza and after adding the additional enriched fields, the result is pushed to an enriched Kafka topic.

Data storage

InFlow requires storage of tens of TBs of data with a retention of 30 days. To support its real-time troubleshooting use case, the data must be queryable in real-time with sub-second latency so that engineers can query the data without any hassles during outages. For the storage layer, InFlow leverages Apache Pinot.

InFlow UI

Figure 3. A screenshot from InFlow UI’s Explore tab which provides a self-service interface for users to visualize flow data by grouping and filtering on different dimensions

The InFlow UI is a dashboard with some of the commonly used visualizations on flow data pre-populated that provides a rich interface where the data can be filtered or grouped by any of the 40 different dimension fields. The UI also has an Explore section, which allows for creation of ad-hoc queries. The UI is based on top of InFlow API, which is a middleware responsible for translating user input into Pinot queries and issuing them to the Pinot cluster.

Pinot as a storage layer

In the first version of InFlow, data was ingested from the enriched Kafka topic to HDFS. We leveraged Trin o for facilitating user queries on the data present in HDFS. However, the ETL and aggregation pipeline added a 15-20 minute delay to the pipeline, reducing the freshness of the data. Additionally, query latencies to HDFS using Presto were in the order of 15-30 seconds. This latency and delay was acceptable for doing historical data analytics, however, for real-time troubleshooting, the data needs to be available in real-time with a maximum delay of 1 minute.

Based on the query latency and data freshness requirements, we explored several storage solutions available at LinkedIn (like Espresso, Kusto, and Pinot) and decided on onboarding our data to Apache Pinot. When looking for solutions, we needed a reliable system providing real-time ingestion and sub-second query latencies. Pinot’s support for Lambda and Lamda-less architecture, real-time ingestion, and low latency at high throughput could help us achieve optimal results. Additionally, the Pinot team at LinkedIn is experimenting with supporting a new use case called Real-time Operational Metrics Analysis (ROMA), which enables engineers to slice and dice metrics along different combinations of dimensions to help monitor infrastructure near real-time, analyze the last few weeks/months/years of data to discover trends and patterns to forecast and plan capacity, and helps to find the root cause of outages quickly and reduce the time to recovery. These objectives aligned well with our problem statement of processing large numbers of metrics in real-time.

The Pinot ingestion pipeline consumes directly from the enriched Kafka topic and creates the segments on the Pinot servers, which improves the freshness of the data in the system to less than a minute. User requests from InFlow UI are converted to Pinot SQL queries and sent to the Pinot broker for processing. Since Pinot servers keep data and indices in cache-friendly data structures, the query latencies are a huge improvement from the previous version where data was queried from disk (HDFS).

Several optimizations were done to reach this query latency and ingestion parameters. Because the data volume for the input Kafka topic is high, several considerations were made to decide the optimal number of partitions in the topic to allow for parallel consumption into segments in Pinot after several experiments with the ingestion parameters. Most of our queries involved a regexp_like condition on the devicehostname column, which is the name of the network device that has exported the flow. This is used to narrow down on a specific plane of the network. regexp_like is inefficient as it cannot leverage any index so to resolve this, we set up an ingestion transformation using Pinot. These are various transformation functions that can be applied to your data before it is ingested into Pinot. The transformation created a derived column flowType, which classifies a flow based on the name of the network device that has exported this flow into a specific plane of the network. For example, if the exporting device is at the edge of our network, then this flow can be classified as an Internet-facing flow. The flowType column is now an indexed column used for equality comparisons instead of regexp_like and this helped improve query latency by 50%.

Queries from InFlow always request for data from a specific range in time. To improve query performance, timestamp based pruning was enabled on Pinot. This improved query latencies since only relevant segments are filtered in for processing based on the filter conditions on the timestamp column in queries. Based on the Pinot team’s input, indexes on the different dimension columns were set up to aid query performance.

Conclusion

Figure 4. Latency metric for InFlow API query for top flows in the last 12 hours before and after onboarding to Pinot

Following the successful onboarding of flow data to a real-time table on Pinot, freshness of data improved from 15 mins to 1 minute and query latencies were reduced by as much as 95%. For some of the more expensive queries, which took as much as 6 minutes using Presto queries, the query latency reduced to 4 seconds using Pinot.This has been helpful in making it easier for the network engineers at LinkedIn to easily get the data they need for troubleshooting or running real-time analytics on network flow data.

What’s next

The current network flow data only provides us with sampled flows from the LinkedIn backbone and edge network. Skyfall is an eBPF-based agent, developed at LinkedIn, that collects flow data and network metrics from the host’s kernel with minimal overhead. The agent captures all flows for the host without sampling and will be deployed across all servers in the LinkedIn fleet. This would provide us with a 100% coverage of flows across our data centers and enable us to support more use cases on flow data that require unsampled information such as security audit and validation based use cases. Because the agent collects more data and from more devices, the scale of data collected by Skyfall is expected to be 100 times that of InFlow. We are looking forward to leveraging the InFlow architecture to support this scale and provide real-time analytics on top of the rich set of metrics exported by the Skyfall agent. Another upcoming feature that we are excited about is leveraging InFlow data for anomaly detection and more traffic analytics.

Acknowledgements

Onboarding our data to Pinot was a collaborative effort and we would like to express our gratitude to Subbu Subramaniam, Sajjad Moradi, Florence Zhang, and the Pinot team at LinkedIn for their patience and efforts in understanding our requirements and working on the optimizations required for getting us to the optimal performance.

Thanks to Prashanth Kumar for the continuous dialogue in helping us understand the network engineering perspective on flow data. Thanks to Varoun P and Vishwa Mohan for their leadership and continued support.

Topics: Open Source Data Streaming/Processing