Query Analyzer: A Tool for Analyzing MySQL Queries Without Overhead
September 15, 2017
LinkedIn uses MySQL heavily, as more than 500 internal applications rely on MySQL. For easy management and better utilization, we built MySQL as a service model which is a multi-tenant architecture. One major disadvantage of this model, however, is that there is a chance that queries from one application can impact others, which is not ideal.
While we had optimized the databases by tuning InnoDB, system, and MySQL server settings, we did not have control over the schema and the queries. We wanted to take that into control by analyzing and optimizing queries. In order to do that, we needed to have complete information on the queries running on the database.
Why do we need Query Analyzer?
In order to better understand runtime application dynamics, we needed to dive into the SQL queries that are run by several hundred applications, understand their performance characteristics, and then dig into options to further tune them.
Due to obvious performance concerns, we did not consider using slow query log. We could set a threshold for queries and then log all the queries crossing the threshold in a file, which we could analyze later. The disadvantage of this approach is that it cannot capture all the queries. If you set the threshold to 0 to capture all queries, it can be catastrophic, because millions of queries will be hitting the server, and logging all of them to a file leads to high IO and drastically reduces the throughput. So, using slow query log is completely ruled out.
The next option we considered was MySQL Performance Schema, which is a feature for monitoring MySQL server execution at a low level (available as of MySQL 5.5.3). It provides a way to inspect the internal execution of the server at runtime. The main disadvantage of using this approach, however, is that enabling or disabling performance_schema requires restart. You can try enabling Performance Schema and then turning off all the consumers, but even this adds overhead of around 8%; if you enable all the consumers, it adds around 20-25% overhead. Analyzing the results from Performance Schema is complex, and to overcome this, MySQL has introduced sys schema as of MySQL 5.7.7. But in order to have historic viewing and graphing, we still need to dump the data from Performance Schema to other servers.
Since neither of these two approaches addressed all of our needs, we have built Query Analyzer, which runs on the network layer, to minimize overhead and effectively measure all the queries.
How does Query Analyzer work?
Query Analyzer has three components:
1) The agent, which runs on the database server.
2) Centralized server, where query information is stored.
3) UI on top of centralized server, for displaying the SQL analytics across all captured databases.
High-level architecture of Query Analyzer
The agent is a service that runs on MySQL nodes. It captures TCP packets using raw sockets, then decodes the packets and constructs the query from the stream of packets using MySQL Protocol. The agent then calculates the query response time by recording the time when the query enters the port and the time when the first packet is sent (after the database responds). The query response time is the difference between when the first packet entered and when the first response packet is sent. The query is then sent to a go routine, which identifies the fingerprint of the query (we use Percona GO packages for this). The fingerprint refers to the query after the data has been sanitized. A hash value is calculated on the fingerprint that serves as the KEY for the query. We can uniquely identify each query through its hash value. The agent stores the query’s hash value, total response time, count, user, and database name in a hash map. If another query of the same hash value comes through, the agent simply appends the count and adds the query time to the total response time. Additionally, the agent also maintains metadata information—query hash value and fingerprint, sample query at max time, first_seen, minimum time, and maximum time—in another hashmap.
The agent collects query information until a preset amount of time has passed. After that, it sends the information (the query hash value, sum_query_time, count, etc.) to the centralized host and then resets the counters. The metadata information is only sent when there is a change in it, like a new query coming in or a query hitting a new minima or maxima. The agent uses less than a few MB of memory to manage these data structures, and the network bandwidth used for sending query information is negligible.
Table 1: Query fingerprint examples
|Query A||SELECT * FROM table WHERE value1='abc'||SELECT * FROM table WHEREvalue1='?'|
|Query B||SELECT * FROM table WHEREvalue1='abc' AND value2=430||SELECT * FROM table WHEREvalue1='?' AND value2=?|
|Query C||SELECT * FROM table WHEREvalue1='xyz' AND value2=123||SELECT * FROM table WHEREvalue1='?' AND value2=?|
|Query D||SELECT * FROM table WHERE VALUES IN (1,2,3)||SELECT * FROM table WHERE VALUES IN (?+)|
Table 2: Sample hashmap
|Query Hash (KEY)||Query Time||Count||User||DB|
Table 3: Sample metadata hashmap
|Query Hash||Fingerprint||First Seen||Sample at Max time||Min Time||Max Time|
|3C074D8459FDDCE3||SELECT * FROM T1 WHERE a>?||1 month||SELECT * FROM T1 WHERE a>0||1ms||3ms|
|B414D9DF79E10545||SELECT * FROM T2 WHERE b=?||1 day||SELECT * FROM T2 WHERE b=430||1s||5s|
|791C5370A1021F19||SELECT * FROM T3 WHERE c<?||1 hour||SELECT * FROM T3 WHERE c<1000000||5ms||7ms|
The UI for displaying the analytics runs on top of the centralized server, where all query information is stored. Users can select the host name and the time range for which they want to see the queries. Aggregated metrics about each query that was run during that time are shown, and you can click on any query to see the query trend graph.
The interesting aspect is the query load percent, which is the load that a query has created with respect to the total number of queries run on the server during that time. For example, assume there are 3 queries.
Query #1 took 2 seconds every time and was executed 100 times. The load it caused is 2*100=200.
Query #2 took 0.1 milliseconds every time and was executed 10M times. The load it caused is 0.0001*10,000,000=1000.
Query #3 took 10 milliseconds every time and was executed 1M times. The load it caused is 0.01*1,000,000=10000.
Thus, the total load on the server during that interval is 200+1000+10000=11200. The percentage load of each query is as follows.
Query #1 is 200/11200*100=1.78%
Query #2 is 1000/11200*100=8.93%
Query #3 is 10000/11200*100=89.29%
Note that the query that the user should look at is Query #3, because it has created 89.29% of the load, even though it is taking only 10 milliseconds, compared to Query #1, which is taking 2 seconds, and compared to Query #2, which is executed 10M times.
The UI is illustrated in the figure shown below. The hostname and table names are masked for security reasons.
Query Analyzer UI showing all the distinct queries
Clicking on any query gives the query trend and more information on the query.
Graph showing query trend
To show the impact on throughput (transactions per second) we have done benchmarks using sysbench on MySQL 5.6.29-76.2-log Percona Server (GPL) with Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz - 12 Core CPU. For these benchmarks, we kept on increasing the sysbench threads and measured the performance. We found that the Query Analyzer did not affect the throughput until we reached 128 concurrent threads. For 256 threads, we observed a 5% drop in the transactions per second, but this was still better than Performance Schema, for which the throughput dropped by 10%.
In our testing, Query Analyzer took less than 1% of the CPU, and while this spiked to 5% when run with more than 128 threads, that amount is still negligible. Please note that number of threads means the number of concurrent queries inside MySQL excluding sleeping connections.
Benchmarking CPU utilization with various tools
For the initial version of Query Analyzer, we have used MySQL to handle the metrics (basically a time-series data). There are two tables: query_history and query_info. The query_history is where we save information from the query hashmap. It has these columns: hostname, checksum, timestamp, count, query time, user, and db. The primary key is (hostname, checksum, timestamp), and the range partitions are by timestamp, with subpartitioning by key on hostname. It has indexes on (hostname, timestamp, querytime, count) and checksum.
The query_info table is used to save the information about the query metadata. It has these columns: hostname, checksum, fingerprint, sample, first_seen, mintime, mintime_at, maxtime, maxtime_at, is_reviewed, reviewed_by, reviewed_on, comments. It has (hostname, checksum) as the primary key and index on checksum.
So far, we have not faced any issues with using MySQL for metrics. But sometimes, when plotting the query trend graph for long time range, we observed some latency. To overcome that, we are planning to emit the metrics from MySQL to our internal monitoring tool, called inGraphs.
Agent needs to be run under sudo. To alleviate potential security issues, you can give elevated permissions "cap_net_raw" to the agent. Further, by setting the execute permission only to a particular user (chmod 100 or 500), you can restrict the run the agent under a specific user without sudo. Refer to https://linux.die.net/man/7/capabilities for more details.
The benefits of Query Analyzer have been numerous. These include allowing our database engineers to identify problematic queries at a single glance, to compare a week-over-week overlay of query activity, and to troubleshoot database slowdowns quickly and efficiently. Developers and business analysts are able to visualize query trends, check the query load in a staging environment before entering development, and obtain metrics per table and database for things like number of inserts, updates, deletes, etc., through which they can analyze the business. From a security standpoint, Query Analyzer allows us to receive an alert whenever a new query hits the database, and we can also audit the queries that are accessing sensitive information. Lastly, analyzing the query load allows us to ensure that queries are distributed evenly across servers, and thereby optimize our hardware. We can also conduct capacity planning more accurately.
While a timeline has not yet been defined, we have plans to eventually open source Query Analyzer and hope that it will be useful for everyone else.
I would like to thank the MySQL team at LinkedIn: Basavaiah Thambara and Alex Lurthu for design review, Kishore Govindaluri for developing the UI, and Naresh Kumar Vudutha for code review. Finally, this project would not have been possible without support from LinkedIn’s data leadership; thanks for all of your support.