Addressing the last mile problem with MySQL high availability

Sundar Raman Ganesh

Building reliability into large scale systems

April 22, 2022

MySQL is the first database choice for applications at LinkedIn that require a relational database store. Managing a large number of databases involves handling numerous complexities; the MySQL SRE team at LinkedIn is responsible for maintaining 99.99% uptime for the service provided to our internal customers. Over the years, we have made many changes to the way we operate our databases to bring down the Mean Time to Recover (MTTR).

A typical MySQL cluster in our ecosystem consists of at least three components:

A primary node accepting writes
One or more replicas for read scaling
A dedicated replica for backup and ETLs

Recently, we adopted the Openark Orchestrator to handle seamless failover of a given primary node to the most consistent replica in our topology. With the adoption of Orchestrator, we do not have to manually act on primary node failures in our ecosystem. Openark Orchestrator was our go-to choice when compared to other alternatives such as Master High Availability (MHA), MariaDB Replication Manager, or Cluster Control due to its anti-flapping mechanisms, failure detection, fault tolerance, handling of network partitions, extensibility with external systems, and support for both GTID and non-GTID MySQL hosts. Moreover, the Pseudo GTID mechanism incorporated by Orchestrator was quickly able to find and promote the most advanced replica in all of our non-GTID clusters.

Being a multi-tenant system that operates on-prem and in the cloud, we overcame certain challenges in adopting Openark Orchestrator in our ecosystem, a process which also had an impact on the service availability. In this blog post, we are sharing what we learned and how we overcame the challenges to create a more resilient ecosystem for our internal customers.

Specifically, we will discuss the top four challenges that impacted our service’s availability, and how we addressed them. These were:

The problem with huge transactions
Measuring the absolute replication lag
How to do candidate selection
The thundering problem

The problem with huge transactions

Because a single database server is shared between a variety of client applications, a single rogue transaction from an unoptimized query could potentially modify millions of rows in one of the databases on the server, causing performance implications for the other databases. These transactions have the potential to overload the I/O subsystem and stall the database server. In this situation, the Orchestrator is unable to get a response from the primary node, and the replicas also face issues in connecting to the primary. This causes the Orchestrator to initiate a failover. This problem is compounded by the application re-trying these transactions upon failure, and stalling the database operations repeatedly.

These transactions halt the database for many seconds and the Orchestrator is quick to catch the stalled state and initiate a failover, impacting the general availability of the MySQL platform.

Detecting huge transactions
We knew that MySQL stores the number of rows modified by any running transaction, and this number can be obtained by querying the trx_rows_modified of the innodb_trx table, in the information_schema database. The total number of rows being modified by the MySQL instance can be obtained by fetching the sum of this field. Knowing the scope of the transactions help us monitor the state of MySQL server before it reaches a failover state.

We introduced a check in our monitoring agent, a custom application built to monitor the health of the MySQL service running on each host, to get SUM(trx_rows_modified) and compare it against a user-defined threshold as shown in the following query:

If the threshold is breached, the primary node running the transaction would be downtimed in the Orchestrator to prevent failover; the downtime is ended when the node goes below the threshold. This threshold was tuned by analyzing transaction metrics of all of our clients.

We notify and later follow-up with the clients to optimize these transactions, and to have the issue that caused the threshold breach permanently resolved. We chose this approach because reviewing every new query wasn’t scalable for the application developers, or the MySQL SREs.

Measuring the absolute replication lag

Orchestrator by default relies on Seconds_behind_master (SBM) reported by the replica to calculate the replication lag. However, this metric is dependent on the replication I/O thread on the replica being connected to the primary; otherwise, it would report NULL. In a few instances when the I/O thread was lagging, the SBM would still report 0. This could potentially lead to the Orchestrator making an incorrect decision while promoting candidates during failover, when the primary node is unreachable. “Candidate” here refers to a replica in the cluster which we have chosen in advance, as explained further in this blog, that will become primary after the failover takes place. “Promotion” here refers to a series of steps done by Orchestrator as part of a failover to turn the candidate into the primary.

To resolve this issue, we needed a heartbeat mechanism to measure the primary-replica replication delay. We leveraged the monitoring agent to insert a heartbeat record with the current timestamp into our primary servers periodically. The time difference between the insertion timestamp in the heartbeat record and the current timestamp on the replica gives us the absolute replication lag.

The heartbeat table has the following schema:

Because our replicas have read_only turned on, the replication lag is reported only on the replicas by the Orchestrator. The other part to this is an Orchestrator configuration (FailMasterpromotionOnLagMinutes) that prevents data loss during failovers. We heuristically determined this value to give enough buffer time for the Orchestrator to determine and promote the most advanced candidate, by analyzing failover metrics from the past. With the previous configuration, we are successfully able to perform automated failovers with no loss of transactions.

How to do candidate selection

As mentioned earlier we have a dedicated replica in each cluster which serves as a candidate for the primary in the event of failover. Our monitoring system identifies these nodes in every topology and calls the Orchestrator RegisterCandidate API to mark these candidates as preferred candidates. The PromotionRule for these candidates is set to “prefer.” We chose the Prefer config value over Must because Orchestrator can still promote another replica in rare cases where the candidate host is also unavailable during failover. These preferences are updated as the nodes of the topology change (i.e., when new nodes are added or existing ones are removed from the cluster).

This is of prime importance in our ecosystem, as we do not want to increase the overhead of database management by handling twice the number of connections because replicas already serving read traffic also have to serve write traffic.

If preferred candidate is not available during failover, any other consistent replica would be chosen by the Orchestrator as the new preferred candidate.

Thundering problem

Because most of our servers operate out of on-prem data centers, we risk having multiple failovers in the event of a brief network partition to a particular switch, or when connectivity to the rack is interrupted. Although rare, we do want to limit the number of failovers done by the Orchestrator in such scenarios.

Orchestrator’s pre-failover hook throttles the number of failovers that can be performed within a two-minute span. The script run by the pre-failover hook returns a non-zero value if the number of DeadMaster events is more than the configured threshold. As a tradeoff, this check increases the time it takes to perform a failover by a few seconds. For that reason, this check is skipped while performing manual maintenance.

Conclusion

It certainly has been an exciting journey to get to higher availability for MySQL clusters at LinkedIn. With these iterative improvements over the years, we have increased the service availability for all of the clusters. The real value in our adoption of Orchestrator is that we can provide a more resilient ecosystem for our internal customers.

Acknowledgements

I would like to thank the entire MySQL SRE team, both past and present, for their continued efforts to improve the availability of MySQL as a service, namely: Amit Balode, for supporting our initiatives; Varun Arora, my manager, for his invaluable contributions to the project, guiding me through the implementation of improvements to MySQL high availability and reviewing the initial drafts of this blog. I would also like to thank my teammate Karthik Appigatla for his contribution to the project and review of the draft for the blog. I would like to thank Pranav Pandey for the implementation of the automatic Candidate Selection. Thanks to Shlomi Noach and other contributors for building and maintaining Openark Orchestrator. Many thanks to all other reviewers of the blog for their suggestions, namely Apoorv Purohit, Deepanshu Mehndiratta, Hannah Sills, Nishan Weragama, and Venu Ryali.

* Disclaimer: Some of the technical terms used in this blog are not part of inclusive language as these denote specific configuration parameters in open source software. We will update these terms once these are replaced in the upstream.

Topics: Open Source Data Management