The impact of slow NFS on data systems

June 23, 2020

Espresso is LinkedIn's defacto NoSQL database solution. It is an online, distributed, fault-tolerant database that powers most of LinkedIn’s applications including member profiles, InMail (LinkedIn's member-to-member messaging system), sections of the main LinkedIn homepage, our mobile applications, and more. Since Espresso caters to many critical features, its responsiveness and reliability are paramount. Espresso stores about 2PB of data (before replication) across its clusters and its busiest cluster serves around 3M QPS.

Our backend systems play an important role in maintaining the integrity of the member experience. The Espresso backup service takes a complete snapshot of its databases once every day. This service runs on all serving storage nodes that contain shards of production data and these backups are persisted in a centralized network file system (NFS) cluster. Snapshots are vital for Espresso’s cluster health because they power functions such as offline processing and disaster recovery.

Though the backup process does not acquire any locks that can affect online operations, we identified an instance where NFS issues cascaded and affected the performance of the underlying Espresso storage nodes. In turn, this led to degraded performance during the backup and at times, even affected Espresso’s availability SLAs.

Triaging SLA misses during backup

When we started noticing these service degradation issues, Espresso’s response latencies to its upstream services were higher than our agreed-upon SLA. We also noted that our database service struggled to cope up with I/O requests to other block devices during NFS issues.

To address the problem, we started with basic triage, such as increasing NFS rsize and wsize, but didn’t see a noticeable impact. We had also throttled our backup write speeds drastically. This helped improve our availability during ideal state, but we still saw sporadic drops in SLA across a few Espresso clusters from time to time during backups. At one point, we decided these workarounds were not sufficient. As we took a closer look, we discovered that the NFS degradation was affecting the online serving side processes and not just the backups.

NFS server side back pressure
When NFS server’s performance degrades, the NFS server shrinks its advertised window to intimate the client to back off and reduce its write rate. The NFS client queues the data in its send Q and slowly forwards the write to the server. Let’s see what happens when we do a packet capture while writing to NFS.

In observing the initial packet capture, the receive buffer advertised by NFS server shrinks to 9 and eventually 0, causing NFS clients to throttle. Up to this point, we had assumed NFS was slower since response latencies to access, read, and write calls were higher. After analyzing packet captures using tcpdump, we were able to confirm that NFS server was actually going through a performance degradation, which, in turn, causes the senders (NFS clients) to slow down. After seeing a correlation between drop in SLA and issues originating in NFS servers, we started looking at how an independent backup process affects Espresso’s serving process.

Reproducing the service degradation
In order to reproduce the issue, we simulated NFS server degradation by adding an iptables rule to drop all outgoing NFS packets. We observed increased CPU usage by kswapd and also observed that the NFS writes didn’t come to an immediate halt when we applied the rules.

Based on this experiment of unreachable NFS, we determined that NFS writes reached the storage node’s page cache and stayed in the Dirty/Writeback buffer. They are flushed to the NFS server only when there is a memory pressure. These flushed pages can’t be removed from the page cache of the storage node until the NFS server acknowledges the data. Such pages are counted as NFS_Unstable.

From the nfs(5) man page, we know that the NFS client delays sending application writes to the server until one of four events occur:

Memory pressure forces reclamation of system memory resources.
An application flushes file data explicitly with sync(2), msync(2), or fsync(3).
An application closes a file with close(2).
The file is locked/unlocked via fcntl(2).

Understanding Linux page cache internals

To understand how we can control eviction of dirty pages from linux page cache, we need to understand a suite of sysctl parameters. Let’s discuss a few of these parameters in depth.

Linux kernel limits the number of dirty pages that can stay in page cache by sysctl values dirty_ratio and dirty_background_ratio.
Here, dirty_background_ratio is the threshold (calculated as a percentage of available memory) at which kswapd will be called to evict dirty pages to disk and this will not cause major performance impact except kswapd using CPU resources.
The parameter dirty_ratio (calculated as a percentage of available memory) is the threshold at which any process requesting disk writes will itself start writing out dirty data to disk.

The number of page scans done in the background by kswapd is reported by sar as pgscank (k signifies the kswapd eviction) and number of page scans done on demand when a process generates write is reported by sar as pgscand (d signifies on demand)

Sample sar output

As discussed, pgscank, which happens in the background, is not contagious enough to affect the performance of other processes, though it increases kswapd’s CPU and IO usage momentarily. But pgscand is contagious and can affect the performance of other processes. Let's take an example to understand how pgscand affects other processes:

Process ‘A’ is generating writes
During this time the kernel sees the sum of Writeback, Dirty, and Unstable pages stay as high as the dirty_ratio due to degraded performance of NFS server
Process ‘A’ now has to wait till some of these dirty pages are freed up before getting its write request satisfied

If the NFS server throttles the sender by a small window size (as we first conveyed in the packet capture), NFS writes gets accumulated in the pagecache as dirty pages, and breaches the dirty ratio. Disk writes of other processes can now be delayed because they have to do an on demand pagescan.

Spikes in pgscand lined up exactly with the SLA drops we faced and we were now able to establish the relationship of our SLA drops to the poor performance of the NFS server.

pgscand happening in storage nodes during our sporadic SLA drops is depicted below.

Aggregate number of pages scanned by pgscand in one of our clusters

95p latency by Espresso during the same period as the prior graph

Mitigating side effects of dirty pages

Our solution is to prevent NFS from dirtying too many pages, so that dirty_ratio is not reached during backups. We considered two isolation options:

Use cgroups and limit memory and disk I/O of the backup process
Leverage sysfs-class-bdi to limit writeback cache used by NFS

Before we moved forward with one of the two solutions, we wanted to reap early benefits of isolation. We decided to have a feedback mechanism on the NFS cluster’s health as we write to it, and rate limit writes, so that we don’t hit dirty_ratio on the client-side. We came up with a solution to make the backup call fsync periodically after a configured batch of NFS writes. The backup jobs would wait for fsync to return before sending more writes to NFS. As a result, the dirty pages are capped by the configured batch size of NFS writes.

A sample barebone script exemplifying this approach is illustrated below. The script writes 100MB and waits for fsync to return before submitting more writes. To validate that the dirty pages are capped at 100MB, we simulated a slow NFS server, by blocking the writes using iptables. The sum of Dirty, Writeback, and unstable memory is capped at 100MB.

If the NFS server is not healthy, fsync would take a lot of time—this acts as a feedback to the Espresso storage node. Though this eliminates impact on the client side, the NFS servers bear the heat, as fsync makes sure dirty data reaches the final backing store from all the caches both on the client and the server.

Assessing the impact

We assessed the impact on backup times while calling fsync for every 100MB writes. Tests performed on a 100GB data-set write experienced a slowdown of the write rate by 14%. This was a tradeoff we could take to shield the online data serving process from impacts during backups. The change helped us boost the platform’s availability.

This graph captures our monthly availability for one of our critical clusters. When we introduced the fsync change, the oscillations greatly reduced from the first week of February and stayed close to 99.999% (from 99.992% in October).

Conclusion

As we’ve shared previously on this blog, “site up” is Linkedin SRE team’s first priority. In regards to our online datastore, Espresso, we wanted to ensure utmost resilience in the face of external degradations. In this journey, we unearthed the behavior of shared page cache across block devices and discovered how one slow block device can affect the performance of processes even when those processes don’t use the slow block device.

Acknowledgements

I would like to thank our Storage team for helping us throughout the endeavor. In addition, I would like to thank Jia Zeng for incorporating fsync changes in Espresso. Also, a big thanks to Cyrus Dasadia for reviewing this blog post. Finally, I would like to acknowledge the constant encouragement and support from my manager, Sankar Hariharan.

Topics: Optimization Developer Experience/Productivity Open Source Infrastructure