Autometrics: Self-service metrics collection
When I started at LinkedIn a little over a year ago the company had just started their move to the current metrics visualization and collection revolution that is currently reaching its peak. I've been fortunate to be part of the process as the primary SRE charged with making metrics easy to collect and consistently implemented.
In the Beginning
Back when I started, our monitoring system had the standard problems:
- Clunky interface, with lots of clicking to get to good data.
- Time consuming to add new people and the system could not keep up with the growth of new engineers.
- Poll-based systems required knowing all the servers and the services they run but no way to read this data from the source of truth.
- Low performance poller forced 5 minute intervals between collection.
- Too many touch points to add a simple metric. Metrics were added by humans and thus had different spacing, naming, and sometimes spelling.
This resulted in lots of waiting around by developers and SREs alike that wanted to get new metrics put in place. It also gave us an inconsistent set of metrics that were hard to work with. Needless to say, this would not scale. It was only going to get harder.
Some early attempts were made to fix this problem by doing a blanket set of standard metrics across the input and output interfaces of our services. These collections of metrics live in what we internally call "Sensors." To fetch the metrics we built a custom poller that read from an internal datastore for the service to server mapping. This poller had a lot of issues. It could collect about once every two minutes or so, but it was fragile and depended on MX4J to respond in a reasonable fashion, which it didn't always do. However this still gave us the ability to work with these new metrics and think about how we really wanted to do it.
Fast forward to April of 2011. There were a number of end-of-year projects that had kept everyone busy. With those out of the way, it was time to get back to metrics and do things the right way. We had more than a handful of brainstorming sessions and decided we needed these things:
- A push based system that always sent the metrics it had regardless of being in a data store or not
- Collect the standard cross-service metrics automatically from every service
- Allow a self-service mechanism for tagging custom metrics for automatic collection
- Programatically name the metrics so we can do things based on the format
Here's what we implemented:
Agent -- We needed an agent to locally collect and push the metrics to our event bus. We wound up re-using an agent that already existed on all the systems that dealt with handling the Sensors. Dave Messink originally wrote the agent code and added in functionality to serialize the output of the sensors in Avro and send it to our event bus.
Event Bus -- If you read this blog you already know that LinkedIn has an open source pub-sub messaging system that we develop actively called Kafka. The client for this product is very simple, the python client is all of 68 lines and about 20 of those are comments. This made work on the agent side very simple. Additionally Kafka scales horizontally, utilizing Apache Zookeeper to track what brokers have what data. This served as the perfect way for us to transport our metrics and ensure we'll be able to grow easily when the time comes. Currently the Kafka setup for metrics handles around 200G of raw avro data per day in a single data center and isn't having any problems.
Collector -- The collector was written by me in Python. I chose Python because this was going to be code that would be supported and developed by the SRE organization. Since Python was chosen as the official language of SRE at LinkedIn it made a lot of sense to use it. I worked around not having a native python interface to Kafka by leveraging a consumer built in to Kafka called the Console Consumer that just spits out data to STDOUT.
The collector reads in the stream over a pipe and feeds it to a pluggable parser. The parser reads in the sensor data and yields small chunks back to the collector with enough data to write or make and update to an RRD file. RRD is just a simple time series database used by a number of different graphing systems. Our graphing frontend already read and displayed these files, and so re-using the format was an obvious first step. A simple parser might look like:
Hardware -- For maximum flexibility we have the collector writing one RRD per metric. This allows for maximum flexibility for adding new metrics, which is very hard in a multi-datasource RRD, but results in a lot of writes to disk. We decided to go with PCIe SSD cards from Virident. This gave us IOPS to spare to write out as many RRDs as we wanted.
This system is now live inside of LinkedIn and developers can tag any sensor and metric they desire and have it show up automatically. This systems is just ramping up with custom metrics, but with the early adopters and the cross-service whitelist we have the following stats:
- 500k+ metrics collected in a production data center every minute or about 8800 per second.
- The average number of metrics per service is about 400, although some services have thousands
- 1 minute resolution is maintained for 30 days, 5 minute for 90, 2 years of 1 hour resolution.
- Each RRD is roughly 815k. Each RRD is written to two collectors to maintain data integrity if a collector is lost, giving us roughly 870G of used disk space.
- We currently have 1.4TB of SSD in production colo. Current disk utilization bursts to 25% during heavy writes, but is generally closer to 5% utilized. We'll run out of disk space before we run out of IOPS.
Self-Service in Operations
Things like instrumentation should be easy to do. We want to encourage people to put metrics in their code and get it somewhere that it can be visualized. Now at LinkedIn you just launch the code and you get metrics. It should never be harder than that. This is only the beginning as well. LinkedIn SREs are working to build tools like inGraphs and AutoMetrics that focus on self-service, ease of use, high configurability and sane defaults. We don't want to push buttons and our developers don't want to open tickets. I bet the same is true at your company.