Fixing Linux filesystem performance regressions
October 16, 2020
As companies grow, adapt, morph, and mature, one item remains the same: the need for reinvention. Technical infrastructure is no exception. As our member community grew, our priorities were to keep up with that growth, or as we say, ensure continuous “site up.” (Read: adding servers to scale from hundreds to hundreds of thousands.) We ran into challenges about how to plan for this type of scaling—in particular, in keeping the platform images and kernels installed on our servers up to date. We moved forward in fits and starts, reimaging our entire physical server fleet in ad-hoc all-hands efforts in order to respond to various extrinsic factors, such as publicly disclosed CPU bugs.
The learnings we took away from these prior efforts allowed us to build a more refined and automated process for reimaging servers going forward, and to more crisply define the lifecycle of the servers on which we deploy LinkedIn’s production stack. With this increased confidence, we undertook an effort to reimage all of the servers comprising Rain, LinkedIn’s private cloud, to CentOS with a modern kernel. This blog post aims to share a new set of learnings from our most recent effort.
At its start, the CentOS reimaging process went mostly according to plan. However, as we neared completion, we suddenly halted the process because of multiple reports of severe 99th percentile latency increases for a serving application when an instance of another application was being deployed on the same physical server. The problem only affected servers with the new image, so we had our work cut out for us to avoid a lengthy and discouraging rollback process across tens of thousands of servers. The bug itself could have disrupted service for our members, and a platform image rollback would carry yet another set of risks.
"Noisy neighbor" problems are well known in multi-tenant scheduling environments like most cloud platforms. To avoid the tragedy of the commons, abstractions such as containerization and time-sharing are introduced. However, abstractions are leaky—one tenant is often able to breach the agreement and unfairly exclude other users of a global resource.
In maintaining our private cloud, we have become familiar enough with this class of problem that we knew exactly where to look: load average, system CPU utilization, page cache utilization, free page scans, disk queues. Atop is a great tool for diagnosing a shared server at a glance.
The interesting thing about this problem is that no shared resource was being exhausted. This pointed in the direction of a mutual exclusion problem. Some tools for diagnosing mutual exclusion problems are to look at the stacks of each process, to echo l > /proc/sysrq-trigger to snapshot each core’s stack, and to use the perf top and/or perf record utilities. All of these tools comprise an approach to determine where the system is applying its wall-clock time, since it seems to not be spending enough of it executing the workload that serves our members.
Strangely, these efforts turned up nothing of interest. The system wasn’t busy; it was just slower for our workload, according to the wall clock, than the older platform image.
Fresh out of quick explanations, we attempted to create test cases to reproduce the problem. A valid test case would be fast on the old platform image and slow on the new one. One team created a test case which reproduced the problem by downloading several large artifacts in parallel. Another team then created a benchmark utilizing the fio test framework, which was fast on the old image but exceeded its configured runtime by multiple minutes on the new image. We determined that the problem existed in both kernels 4.19 and 5.4.
Concurrently, we deduced that this problem exclusively impacted older servers with HDD (rotating) root disks, which we arrange in a software RAID1 mirror configuration—newer servers with SSD (solid-state) root disks were unaffected. One engineer noticed that there was an upstream bug tracking an issue seemingly related to blk-mq. It seemed plausible that this was a blk-mq problem since blk-mq was introduced after the release of kernel 3.10 (our previous golden image kernel). It was also evident that the system was not actually busy while the latency problem existed, so it seemed to be reasonable to hypothesize that inefficient I/O submission was the root cause.
Applying the patch attached to that upstream bug (to reduce the number of queues in the scalable bitmap layer) did improve performance enough to move some teams forward. Based on this result, we explored the possibility that the regression was related to the scsi-mq migration and the new I/O schedulers it required. However, after trying a number of configurations, it was clear that the choice or configuration of the I/O scheduler had little to no impact on the problem.
Due to prior experience in our storage tier, we were also familiar with the ext4 regression introduced in Linux 4.9 for direct I/O workloads. There was no equivalent guidance that we could find addressing increased latency on normal buffered I/O workloads in kernel 4.x. With suspicion of ext4 aroused by that existing direct I/O issue, we decided to replace the ext4 filesystem with XFS on some test servers to determine whether ext4 was again at fault here.
Surprisingly, we found that the problem was indeed nonexistent on XFS. Remember, this problem exclusively affected servers with rotating disk storage, so what could cause a latency problem exclusively on ext4, only on rotating HDD devices, and on a server that isn’t busy?
Rooting out the cause
Lacking other palatable options and knowing that the problem was introduced sometime between kernel 3.10 and 4.19, we proceeded to take one of the affected servers out of rotation and bisect the kernel. Bisecting the kernel on a physical host in a datacenter came with its own set of complications and workarounds, requiring additional assistance across multiple teams.
In kernel 4.19, we reverted the following two commits that were introduced during the 4.6 release cycle to fix the regression. The second had previously been reverted by Linus:
- Commit 06bd3c36a733 ("ext4: fix data exposure after a crash")
- Commit 1f60fbe72749 ("ext4: allow readdir()'s of large empty directories to be interrupted")
In kernel 5.4, we cherry-picked the following commit that was introduced during the 5.6 release cycle to fix the regression:
“ext4: make dioread_nolock the default” from commit e5da4c933, an ext4 merge
- Interestingly, setting the dioread_nolock boot param had no effect.
- Our workload is buffered I/O; what should a change around dioread_nolock, a direct I/O knob, have to offer in this situation?
The following graph shows approximate testing results as we iterated to converge on the 3.10 kernel’s performance:
Convergence of kernel 4.x to 3.10 fio benchmark completion time during development.
Configuration matters as much as code
We stumbled on configuration regressions several times in the back and forth with the kernel vendor.
One unexpected example was that in the back and forth, the scsi-mq layer had been disabled in the vendor’s kernel config in favor of the legacy block I/O layer. According to popular wisdom, scsi-mq was expected to perform worse on single-queue devices, due to the additional overhead of concurrently submitting I/O and the potential result of a globally suboptimal request ordering. We found the opposite in kernel 4.19: the old deadline I/O scheduler had worse tail latency under heavy I/O to a single-queue rotating HDD device than even the noop multi-queue scheduler.
Why is it that we found ourselves testing against the noop multi-queue scheduler? Again, due to a separate kernel build configuration error, the multiqueue schedulers were compiled as modules rather than compiled into the kernel as builtins. When compiled as modules, they were not available when the HDD block device was registered. As a result, the noop scheduler is applied instead of the scheduler that would be more appropriate for the device (in this case, mq-deadline):
We learned quite a bit from this exercise and, most importantly, were able to complete the system image upgrade on our private cloud. This ultimately benefits our members by enabling our engineers to safely ship new products and features.
One thing that Linux users need to know is that ext4 performance on kernel versions above 4.5 and below 5.6 suffers severely in the presence of concurrent sequential I/O on rotating disks. NVMe devices are not affected, nor are other filesystems. We highly recommend not using those kernels for ext4 on rotating storage media. Since the kernel 4.x series is EOL as of 2020, the only realistic option for most people is to upgrade to kernel 5.6 or above.
Vendor configuration options can also play a role in bisecting. Always ensure that you’re starting from a clean upstream release when tackling an upstream bug, and be prepared to find secondary regressions in vendor-internal configs and in-house patches that mask the root cause.
For LinkedIn, we incorporated this experience into our golden image qualification process by making our suite of I/O stress tests more comprehensive. This incident also spurred us to explore investing in automation around lightweight filesystem reprovisioning without reimaging, i.e., so that we could convert an in-service host from one filesystem to another without rebooting by merely evacuating its applications and reformatting the partition where application data is stored.
In no particular order, these individuals each unblocked at least one key aspect of this investigation: Sergiy Zhuk, Kyle Reid, Pradeep Sanders, Tim Crofts, Mir Islam, Adam Debus, Hengyang Hu, Chris Stufflebeam, Navoday Tomar, and Sasha Levin at Microsoft.