Designing SSD-Friendly Applications

Zhenyun Z.

May 3, 2016

These days, solid state drives (SSD) are being increasingly adopted to alleviate the I/O performance bottlenecks of applications. Numerous measurement results have showcased the performance improvement brought by SSD as compared with hard disk drives (HDD). However, in most deployment scenarios, a SSD is simply treated as a “faster HDD.” Hence, the potential of SSDs is not fully utilized. Although applications gain better performance when using SSDs as storage, the gains are mainly attributed to the higher IOPS and bandwidth provided by SSDs.

Improvements in application performance thanks to SSDs could be more significant if applications were designed to be SSD-friendly. In this blog post, we propose a set of SSD-friendly design changes at the application layer for three types of benefits: (1) improved application performance; (2) increased SSD I/O efficiency; and (3) longer SSD lifespan.

Note: An extended version of this blog post is published in IEEE Compsac 2016.

Let’s first quickly review some necessary technical background. Please refer to other resources for details.

Cell, page, block. The mainstream SSD today is NAND-based, which stores digital bits in cells. Each SSD cell can store one bit (SLC, Single Level Cell), two bits (MLC, Multiple Level Cell), three bits (TLC, Triple Level Cell) or even four bits. Cells can only sustain a certain number of erasures. The more bits a cell stores, the less the manufacturing cost, but the endurance (e.g., number of erasures) is also significantly reduced. A group of cells compose a page, which is the smallest storage unit to read or write. A typical page size is 4KB. Once erased (i.e., initialed or reset), a page can only be written once. So essentially there is no “overwriting” operation on SSD. Pages are grouped into blocks. The typical size of a block is 512KB or 1MB, or about 128 or 256 pages.

I/O and Garbage Collection. There are three types of I/Os: reading, writing, and erasing. Reading and writing are in the unit of pages, and the writing latency could vary, depending on the historical state of the disk. Erasing is in the unit of blocks. Erasing is slow, typically a few milliseconds. Garbage Collection (GC) in SSD is needed to recycle used blocks and ensure the fast allocation of later writes. Maintaining a threshold of free blocks is necessary for prompt write response because applications may suffer from on-the-fly block erasures due to the slow erasure process.

Wear leveling and write amplification. SSD blocks can only survive a limited number of erasures, also known as program/erase (P/E) cycles. Some blocks may host very active data and are therefore erased frequently. Once the maximum number is reached, the block dies. The typical number of P/E cycles is 100,000 for SLC blocks, 10,000 for MLC blocks, and a few thousand for TLC blocks. To ensure capacity and performance, the blocks need to be balanced in terms of how many erasures have been done. SSD controllers have such a “wear leveling” mechanism to achieve that. During wear leveling, data is moved around among blocks to allow for balanced wear out. Partly because of this, the actual written bytes are a multiple of the logical bytes intended to be written, a phenomenon referred to as “write amplification.” These numbers and terms are important to understand how applications can be better tuned for optimal SSD performance.

Compared with the naive adoption of SSD (i.e., no application changes) by as-is applications, SSD-friendly applications can gain three types of benefits:

Although moving from HDDs to SSDs typically means better application performance, thanks to the better I/O performance, the naive adoption of SSDs without changing application designs may not achieve optimal performance. We have an application that writes to files to persist data and is I/O-bound. The maximum application throughput when working with a HDD is 142 queries per second (qps). This is the best result we can get, irrespective of various changes or tunings to the application design.

When moving to SSD with the same application, the throughput is increased to 20,000 qps, which is 140x faster. This mainly comes from the higher IOPS provided by the SSD. Although the application throughput is significantly improved when compared with a HDD, it is not the best it can potentially achieve.

After optimizing on the application design and making it SSD-friendly, the throughput is increased to 100,000 qps, a 4x further improvement when compared with the naive SSD adoption. The secret for this particular example is using multiple concurrent threads to perform I/O. This takes advantage of the SSD’s internal parallelism (described later), as shown in the figure below. Note that multiple I/O threads do not work well with HDDs.

As noted earlier, the minimum internal I/O unit on SSD is a page (e.g., 4KB ), hence a single byte’s access (read/write) to a SSD has to happen at the page level. Partly due to this, application’s writing to SSD can result in a larger physical write size on SSD media, an undesirable phenomenon referred to “write amplification (WA)”. The ratio is referred to as “write amplification factor” or WA factor. Because of the WA factor, SSD may be substantially under-utilized if the data structure or the IO issued by applications is not SSD-friendly.

SSD cells wear out, and each cell can only sustain a limited number of P/E cycles. Practically, the life of a SSD depends on four factors: SSD size, maximum number of P/E cycles, write amplification factor and application writing rate. For example, consider a SSD of 1 TB size, an application with write rate of 100 MB/s and a MLC SSD with the number of P/E cycles being 10,000 (a typical value for today’s mainstream MLC SSDs). When the write amplification factor is four, the SSD only lasts for 10 months. A TLC (Triple-Level Cell) SSD, with 3,000 P/E cycles and a WA of 10, only lasts for one month. Given the high cost of SSDs, the applications are desired to be SSD-friendly for the purpose of increasing SSD lifespan.

There are several places where SSD-friendly design choices can be made. Let’s quickly go over some changes that can be made at the file system, database, and data infrastructure tiers.

The file system tier deals with storage directly, so it’s natural that design changes are needed at this level to accommodate SSDs. Generally speaking, these design changes are focused on the three key differentiating characteristics of SSDs:

Random access on par with sequential access;
Blocks erased for overwriting;
Internal wear leveling, which causes write amplification.

There are two types of SSD-friendly file systems. The first is the general file systems adapted for SSDs, mainly supporting the new feature of TRIM. Examples of this include Ext4 and Btrfs. The second is the file systems that are specially designed for SSDs. The basic idea is to adopt a log-structured data layout (vs. B-tree or Htree) to accommodate SSD’s “copy-modify-write” property. Examples are NVFS (non-volatile file system), JFFS/JFFS2, and F2FS.

The differences between SSDs and HDDs are particularly important in database design. For decades, database components such as query processing, query optimization, and query evaluation have been tuned with HDD characteristics in mind. One example of this tuning is the fact that random accesses are considered much slower than sequential access. With SSD, many of the assumptions do not hold, hence SSD-friendly databases have been designed. There are mainly two types of databases that work well with SSDs:

Flash-only databases such as AreoSpike, which benefit greatly from flash-friendly join algorithms; and
Hybrid flash-HDD databases, which judiciously use flash to cache data.

For distributed data systems, there is a debate over where to load data—from local disk on the same machine or from remote memory on another machine. In the past, the argument favored the latter, which is faster. Memcached is an example of this. Traditional HDD access latency is on the order of milliseconds. Meanwhile, the latency of remote memory access, which includes both RAM access latency and networking transmission latency, is at only a microsecond level. The I/O bandwidth of remote memory is on par with, or faster than, that of a local HDD.

With SSD being the storage, local disk becomes more favorable than remote memory access. The I/O latency of SSD is reduced to the microsecond level, while the I/O bandwidth can be an order higher than HDD. These results motivate new designs at the data infrastructure tier (such as co-allocating data with the applications as much as possible to avoid additional nodes and associated network hops), which reduces both complexity and cost.

One of the companies that is adopting this kind of design is Netflix. Previously, memcached was used to cache data behind the company’s Cassandra layer. Assuming the need is to cache 10TB of data, if each memcached node holds 100 GB data in RAM, then 100 memcached nodes should be deployed. With 10 Cassandra nodes, it is possible to completely remove the memcached layer by equipping each Cassandra node with a 1TB SSD. Instead of having 100 memcached nodes, only 10 SSDs are needed, a huge cost savings. The query latency performance, as reported, is on par with the original design. The SSD-based infrastructure is more scalable with a much higher IOPS capacity.

At the application tier, we can also make SSD-friendly design changes to gain the three types of benefits (e.g., better application performance, more efficient I/O, and longer SSD life). These changes are organized into three categories: data structure, I/O handling, and threading.

Conventional HDDs are known to have substantial seeking time, hence applications that use HDDs are often optimized to perform in-place updating that does not require seeking. For instance, an application that persists data to storage has significantly different IOPS performance for random updating vs. in-place updating. When performing random updating, the workload can only achieve about 170 qps; while for the same HDD, in-place updating can do 280 qps, much higher than random updating.

When designing applications to work with SSDs, such concerns are no longer valid. In-place updating does not gain any IOPS benefit compared with non-in-place updating. Moreover, in-place updating actually incurs a performance penalty on SSD. SSD pages containing data cannot be directly overwritten, so when updating the stored data, the corresponding SSD page has to be read into SSD buffer first. After the updates are applied, the data then are written to a clean page. The process of “read-modify-write” in SSD is in sharp contrast to direct “write-only” behavior on a HDD. By comparison, a non-in-place update on an SSD does not incur the reading and modifying steps (i.e., it only “writes”), hence is faster. With SSD, the same application above can achieve about 20,000 qps using either random updating or in-place updating.

For almost all applications that deal with storage, data stored on disks is not accessed with equal probabilities. Consider a social network application which needs to track active users’ activities. For users data storage, a naive solution would simply compact all users in the same place (e.g., files on SSD) based on trivial properties such as registration time. When updating the activities of hot users, a SSD needs to access (i.e., read/modify/write) at a page level. So if a user’s data size is less than a page, the nearby users’ data will also be accessed together. If nearby users’ data are not needed, then the bundled I/O not only wastes I/O bandwidth, but also unnecessarily wears out the SSD.

To alleviate this performance concern, hot data should be separated from cold data when using SSD as the storage. The separation can be done at different levels or different ways, such as different files, different portions of an file or different tables.

The smallest updating unit in SSD world is the page (e.g., 4KB), hence, even a single-bit update will result in at least a 4KB SSD write. The actually written bytes to SSD could be far larger than 4KB, due to write amplification. Reading is similar; a single byte read will result in at least 4KB reading. Note that the actually read bytes could also be much larger than 4KB, since OS has read-ahead mechanism to aggressively read in file data beforehand in the hope of improving cache-hit when reading files.

Such read/write characteristics favor the use of compact data structure when persisting data on SSD. A SSD-friendly data structure should avoid scattered updates. The benefits of using compact data structure are faster application performance, more efficient storage IO and also saving SSD life.

SSDs typically feature GC mechanisms to collect blocks offline for later use. GC can work in background or foreground fashion. SSD controller typically maintains a threshold value of free blocks. Whenever the number of free blocks drops below the threshold, background GC will kick in. Since background GC happens asynchronously (i.e., non-blocking), it does not affect the application’s I/O latency. If, however, the requesting rate of blocks exceeds the GC rate and background GC fails to keep up, foreground GC will be triggered. During foreground GC, each block has to be erased on-the-fly (i.e., blocking) for applications to use, and the write latency experienced by applications which issue writes will suffer. Specifically, a foreground GC operation that frees a block could take more than several milliseconds, resulting in large application I/O latency. For this reason, it is better to avoid long, heavy writes so that foreground GC is never kicked in.

We conducted experiments to see how the write rate affects the I/O performance. The write rate varies from 10 MB/s to 800 MB/s and runs for two hours for each write rate. For each test case, we counted the max write latency, as well as the number of “large latencies” (larger than 50 ms). When the write rate is light at 10MB/s, the largest I/O latency is 8 ms. The latency increases with higher write rates, and it is 92 ms when write rate is 800MB/s. The large latencies that are more than 50ms are not observed when the write rate is 10MB/s or 50MB/s. For 800MB/s write rate, 61 such latencies were observed.

Comparing the number of large (> 50ms) write latencies versus maximum write latency (ms):

SSD usage level (i.e., how full the disk is) impacts the write amplification factor and the writing performance caused by GC. During GC, blocks need to be erased to create free blocks. Erasing blocks requires preserving pages that contain valid data in order to obtain a free block. The number of blocks required to be compacted to free a block is determined by how full the disk is. Assuming the disk full percentage is A%, on average, to free a block, 1/(1−A%) blocks will need to be compacted. Apparently, the higher usage a SSD is, the more blocks will be moved around to free a block, which takes more resources and results in a longer IO wait. For instance, if A=80%, about five blocks of data are moved around to free one block. When A=95%, about 20 blocks are moved.

Examining the number of pages that need to be preserved during GC, the impact is even higher. During GC, the live pages that contain valid data need to be copied around. Assuming each block has P pages, then each Garbaged Collected block requires copying PA/(1−A) pages. Assuming P=128, when A=80%, 512 pages; when A=95%, it is 2432 pages! Hence, to ensure GC efficiency, SSD should not be fully occupied by live data.

Comparing the number of compacted blocks and the number of compacted pages:

A SSD has multiple levels of internal parallelism: channel, package, chip and plane. A single I/O thread won’t be able to fully utilize these parallelisms, resulting in longer access time. Using multiple threads can take advantage of the internal parallelism. SSD can distribute read and write operations across the available channels efficiently, hence providing a high level of internal I/O concurrency. For instance, we used an application to perform 10KB write I/O. Using one I/O thread, it achieves 115MB/s. Two threads basically doubles the throughput; and four threads doubles it again. Using eight threads achieves about 500MB/s.

A natural question is: how small is “small”? The answer is that any I/O size that cannot fully utilize the internal parallelism is considered “small”. For instance, with a 4KB page size and a parallelism level of 16, the threshold should be around 64KB.

Comparing the use of more threads to do small IO and the use of fewer threads to do big IO:

This design change is just the other side of the picture depicted in the previous design change. For large I/O, the SSD’s internal parallelism can be taken advantage of, hence a fewer threads (i.e., one or two) should suffice to achieve maximum I/O throughput. Throughput-wise, many threads don’t see benefits.

Moveover, multiple threads will incur other problems and overheads. For instance, the resource competition between threads (e.g., SSD mapping table) and interfered background activities such as OS-level readahead and write-back are all examples of complications. For instance, based on our experiments, when the write size is 10MB, one thread can achieve 414MB/s, two threads 816 MB/s, while eight threads actually drops to 500MB/s.

Applications that use SSDs generally have better performance levels than those that use HDDs. However, without changing applications designs, the application cannot achieve optimal performance, because SSDs work differently from HDDs. To unlock the full performance potential of SSD, application designs must be SSD-friendly. The SSD-friendly design changes that we proposed in this blog post work at the application layer, but they can also help drive storage-layer and cross-layer designs.

Topics: Optimization Product Design Infrastructure