InFlow - Making the LinkedIn network visible
March 22, 2016
To maintain the high network availability needed to serve all LinkedIn applications, we need to monitor and analyse both network infrastructure and network usage patterns. At LinkedIn, we traditionally use SNMP to monitor traffic patterns passing through different layers of networks. The monitored data is visualized using inGraphs and anomalies are sent as alerts to network engineers using our in-house alerting system called AutoAlerts. Other protocols like Netconf, vendor specific APIs, etc. are also used to monitor the network.
However, most of these monitoring systems address the question “How many bytes of data are transferred across network?” but do not answer:
- What kind of data is getting transferred?
- Who (which service) is transferring the data?
In the past, lack of data has limited our ability to troubleshoot link-hogging issues, perform capacity planning, understand service usage patterns, detect anomalies in network, and do traffic flow analysis. Having only approximate data about the usage pattern of our network infrastructure limited us in effectively utilizing and scaling our network infrastructure to ever-growing needs. To solve this problem, we needed to devise a solution that could give us more insight into network traffic.
A brief explanation about flow information
Netflow/IPFIX flow and SFlow are specifications implemented by device vendors to provide flow information periodically. Flow information is a sample of traffic passing through network gear that contains detailed information about the type of traffic being transferred at the network layer. A subset of information that can be exported by a device in each flow include:
- Source and Destination (IP)
- Source and Destination (Port)
- Source and Destination (ASN)
- IP of the network gear that is exporting flow
- Input and Output interface index of the network gear where the traffic is being monitored
- Number of bytes transferred
The InFlow application has been built at LinkedIn to precisely answer the who, what, when, where, and how of network traffic by processing flow information exported from a device. InFlow has the ability to integrate with internal and third party applications to enrich traffic information (“enriching” is the process of mapping IP to different possible values/dimensions). InFlow has a comprehensive and flexible reporting mechanism, that helps network owners in understanding:
- Where traffic on the network is coming from and going to
- Which interfaces and devices are transferring more bytes of data
- Which peering links are effectively used
- Top talkers of applications on the network
- Traffic trends on the network over a period of time
- Ability to view source and destination hosts/ports, contributing to traffic numbers
Network traffic data is processed and stored in the Hadoop environment. In order to make intelligent data-driven decisions, we need a simple and intuitive view of the data that can be presented to the user. InFlow precisely does this to represent processed data. Users can drill down to see hourly trend or the aggregate raw data. Aggregate raw data shows raw samples collected from network gear.
We had to address two major challenges before getting any useful information from the collected flows:
- Data Quantity: the amount of sample data processed was huge and difficult to work with. While data varied based on traffic, on average we observed one million flows per minute
- Data Quality:
- Flows were samples and not actuals. Flow traffic did not match the actual traffic that was graphed using SNMP
- Flows have information about source and destination IP. Consuming them “as is” is challenging for engineers. It’s easier for users to view and consume them when the data is aggregated by each service than by individual network gear nodes.
At LinkedIn, data driven decisions are given the utmost importance. There are a plethora of tools/applications/platforms that can munch big data and produce useful information, like Kafka, Gobblin, Cubert, Samza, and Pinot. Another LinkedIn core value is Leverage, and InFlow leverages the LinkedIn data analytics ecosystem to process data collected from network gears.
Architecture of InFlow
As shown in the image, flows are collected from network devices (the blue discs) and enriched. Enriching happens within Collectors, where flows are collected. This enables different consumers to use the enriched data, without knowing the intricacies of enrichment. Enriched network traffic data is sent to Kafka as it arrives. Gobblin then transfers data from Kafka to Hadoop. In Hadoop the ETL jobs kick in and process the data for the last elapsed hour. The ETL is instrumental in extrapolating the raw traffic information and loading into the data store. The data store keeps both aggregated and raw data. While storing aggregated data in the form needed by the user interface helps in drastically decreasing application launch time, storing the raw data helps in identifying the end system that is responsible for generating traffic. The InFlow user interface consumes the data stored in the data store and renders it for the user.
Flows have only source and destination IP information. As mentioned earlier, consuming them as such is difficult for engineers. This led to the need to group them in a consumable format. Each problem and each group of engineers needs different dimensions of the same data. For example, Site Reliability Engineers are interested in traffic generated by each service, whereas Network Engineers are interested in traffic on the peering network. The first step to visualize data in different dimensions is to enrich them. InFlow integrates with different internal/external applications using their APIs to map flow information to different values. The InFlow UI provides visualization to view each dimension.
Flows exported by devices are samples. Sample data gives the notion of traffic, but not the exact amount of traffic. So with sample data, it’s not possible to provide, for instance, the “amount of traffic generated by a particular service.” To answer this, data is extrapolated. Extrapolation is the process of identifying the actual amount of traffic transferred between any source and destination from the sample data. The extrapolation algorithm works in the following sequence:
- Calculate the actual traffic on a router interface
- Calculate the traffic on a router interface that is captured using sample
- Scaling factor for a router interface using the expression: Scaling factor = (Actual Traffic) / (Traffic captured using samples)
- Multiply the scaling factor by the captured sampled bytes to extrapolate the actual bytes transferred
Network engineers are able to pinpoint those services which are pushing huge volumes of traffic. This has reduced the turnaround time to resolution for network engineers when links are hogged. The ability to enrich flows in different dimensions has given a new perspective to traffic data.
Currently, the processing of network data in Hadoop is not real time, primarily due to how ETL jobs are set up. This presents a drawback when live issues are analyzed from a network standpoint. Our next goal is to process the flows in real time and thereby provide real-time analytics on the network traffic data. Other features envisaged to solve customer needs might include:
- Correlating flow information with BGP information to understand traffic path
- Enriching traffic with information that helps service owners
- Ability to detect DDoS attacks
InFlow is the result of effort from different teams and people. Special thanks to Vikas Kumar, Avinash Prasad, Pradeep Hodigere, Arun Manohar, Prashanth Kumar, Chintan Shah and the Data Services team @ LinkedIn.