inMesh: Real-Time Monitoring of Remote Sites

Alagar A

Senior Engineering Manager at Linkedin

April 20, 2016

Many IT organizations support offices distributed across the world. As the number of remote sites increases, it becomes more and more difficult for operations teams to understand network and application behavior and performance from each site. By “site”, we mean remote offices and workplaces and internal data centers where applications are hosted.

At Linkedin, with our corporate applications distributed across our internal data centers and the cloud, it is critical for the IT team to understand the performance of applications from all sites across the world. This led us to choose a customized solution to proactively monitor network and application performance from each remote site.

Though we have external solutions like ThousandEyes and Catchpoint, one of our requirements was to have the ability to perform synthetic transaction checks on internal applications that are not exposed to the internet.

Additionally, our solution had to ensure that any anomalies detected would trigger a notification to the respective network and application owners before the issue was reported.

inMesh is a system developed at Linkedin that finds anomalies in remote sites and corporate data centers used by Linkedin. Apart from anomalies, inMesh also helps collect performance and quality metrics between our offices and data centers for easy visualization.

The tool has real-time collection and visualization of the following metrics for each site:

Accessibility of intranet and internet sites
Collecting metrics like packet loss, latency, jitter between all sites and datacenters with QOS Queues
Intranet and internet link failover check
Accessibility of cloud services that Linkedin uses
Accessibility of critical corporate applications
Speed / throughput of network (intranet/internet)

These checks as well as the general collection of metrics, happen with a sub-minute granularity. With inMesh, the corporate network team is alerted for any network-related issues, while any application issues are reported to the respective service owners.

The inMesh node is a server that runs a daemon called remote executor, which listens on a specified port. This inMesh node is deployed with applications that get executed to find anomalies and for collection of data. One inMesh node is deployed for each site monitored.

A cluster of monitoring servers, located in a data center, is responsible for interacting with all inMesh nodes.

Network engineers and application owners want to make sure a site gets monitored right before it starts functioning. There are a few challenges associated with this.

The monitoring server cluster has to be horizontally scalable, as it should be able to handle an arbitrary increase in the number of sites and inMesh nodes.
Deployment of new inMesh nodes must be fast.
inMesh nodes get monitored all the time. Since each inMesh node represents a site, they must be very responsive.
In order for inMesh nodes to be responsive, anomaly checks and metrics collection must happen asynchronously without blocking calls.

The monitoring servers send commands like ping, traceroute, curl, and custom application calls, which perform synthetic transactions from the inMesh nodes over the port designated by the remote executor. inMesh nodes execute these calls in an asynchronous manner using the Twisted library and transmit the output of each command back to the monitoring servers. This output is processed on the server to capture required checks and metrics (For example: packet loss is captured from the output of the ping command). The processed output is then sent to InGraphs, Linkedin’s inhouse time-series graphing solution, and anomalies are captured in Linkedin’s alerting solution. Since we are dealing with time-series metrics, we have leveraged InGraphs so that inMesh does not have to store data locally. In this process of collecting data, monitoring servers report health status of inMesh nodes quite frequently and get alerted for any failures. All of these checks happen quite frequently, and so anomalies are reported within minutes.

With inMesh in place, remote sites are pro-actively monitored 24x7. Also, network engineers and application owners are getting to know about the issues that users face in a site more quickly than it gets reported by staff at the site.

We are planning to incorporate many new features in this system. Here are a few:

When anomalies get reported to network engineers and application owners, they would like to see more information captured through inMesh, which they can use for troubleshooting, such as entire output of ping, traceroute, custom applications and other information they would need to be captured and reported to them.

A database would be used for capturing information. When an anomaly is reported, output from all applications (ping, traceroute, custom applications) is captured and stored in database mapped to time and site. This happens as long as the anomaly is active for a site.

Since it is mapped to time and site, it would be easy to retrieve the information and show it visually.

The wireframe for visualization looks like:

At this point, when a network goes down for a site, even application owners will be unnecessarily notified about their applications, because their applications are also not accessible. In the future, inMesh will be enhanced to perform correlation on all anomalies and to send “smart” notifications to the owner of the problematic component.

There are many people who have helped during the design, implementation, deployment, and operation of inMesh. Thanks to Curtis Salinas, Pradeep Hodigere, Veerabahu Subramanian for helping in the design and Vinod Reddy for helping make it more useful for network team. Special thanks to all LinkedIn IT support folks, who have helped set up inMesh nodes on each site.

Topics: Analytics Optimization