Introducing data compaction in Ambry
May 8, 2019
Three years ago, LinkedIn announced and open sourced Ambry, a distributed, highly available and horizontally scalable immutable object store optimized to store and serve media. As we stated at the time, this was an important update given how vital media content is for any website to increase user engagement, virality, and monetization. Since its introduction, we have invested heavily in making Ambry more performant, scalable, and operable, while simultaneously reducing the cost of storage and serving, and increasing its appeal to a large variety of use cases beyond media storage.
Ambry’s feature set provides many options to manage the lifecycle of an object. In particular, it provides the ability to:
- Explicitly delete objects that were created at any point in time.
- Set a time-to-live (TTL) for an object at creation time; once the TTL elapses, Ambry cuts off access to the object for user requests (auto expiry).
These two operations result in objects that are defunct but still resident on disk and open up the possibility of reclaiming the space occupied by them.
As Ambry grew in scale and onboarded more types of use cases at LinkedIn, including cases that used Ambry as a temporary cache for sharing large objects between services (to avoid having to deal with sending large objects, services store the object in Ambry with a TTL and share the Ambry ID), it became increasingly necessary to automatically and periodically reclaim space occupied by these defunct objects.
In fact, Ambry’s storage footprint at LinkedIn is in the petabytes, with nearly 10% of objects that are uploaded to Ambry having a TTL of a week and between 30-40% of them either expiring or being deleted within 3-6 months. So, as a cost-to-serve reduction initiative, the benefit of regularly cleaning up defunct objects is straightforward, as it enables a more efficient use of the available storage and reduces cost.
This blog post is about the major data compaction initiative we’ve undertaken to make effective use of available storage by efficiently cleaning up defunct objects.
Data compaction in Ambry
In general, data compaction (or “compaction” for short) involves one or a combination of techniques including but not limited to: more compact representations of data, removal of unnecessary redundancy, and cleanup of defunct objects. In Ambry, data compaction refers to the cleanup of objects in storage that are either deleted or past their TTL. The following sections go into more detail, including a discussion of the storage abstractions, changes we made to make them more amenable to compaction, the compaction process, its execution strategy, and more.
Before we get into the details of compaction in Ambry, here is a quick primer on the design of Ambry’s storage layer.
Ambry partitions the total storage available in the cluster into logical partitions. Each of these partitions is then mapped to a number of physical replicas. The physical representation of a particular replica is a Store. A single storage server hosts multiple Store instances, which are distributed over all the disks available on the server. Figure 1 represents the layout of a storage server.
Figure 1: Storage server
Each Store instance comprises a Log and an Index. The Log is an append-only storage abstraction in which objects are laid out sequentially. New objects are always written at the logical end. In order to enable quick access to a particular object, the Store employs the Index, which is represented by a set of files (index segments) that map an object’s ID to its location in the Log. When certain criteria (e.g., number of entries, size of index segment, etc.) are met, an index segment file is sealed (made read-only) and written to disk as a Sorted String Table (SSTable) and a new index segment is created for subsequent additions.
Since the Log is an append-only data structure, deletions are represented as tombstones; an object is not deleted physically in the context of a delete request, but instead a special record that invalidates it is written both in the Log and Index. Objects that are invalidated due to their TTL elapsing do not insert any extra records in the Log (the expiry of an object is determined by looking at its creation time and TTL and comparing it with the current time), but are not cleaned up either. The Store will stop serving this object, but it will still be a part of the segment that it was created in. Both these situations leave the affected objects irrelevant and, quite literally, a waste of space.
Compaction in Ambry can be defined as the process of sweeping the Log to expunge invalidated objects. Since the Log is an append-only data structure, compaction can be achieved by creating a new Log with only the valid objects and discarding the one with invalidated objects.
The first version of Ambry did not have an automated process for compaction and space could only be reclaimed through operationally complex procedures leveraging replication that involved possible downtime. At a smaller scale, it was still possible to employ these procedures to reclaim space, but with the growth in usage of Ambry at LinkedIn, it became necessary to design and implement an automated process for compaction.
Change in Log architecture
When we set out to design an automated process, we outlined some operational goals for our final product, primary among them being simplicity (to trigger, monitor, and tune), efficient resource usage (both disk space and I/O bandwidth), and zero down time. In the process of designing a solution, we recognized the need to re-think and improve the existing Log architecture to make it more amenable to these goals.
Log as a single file
The initial design of Ambry modelled the Log as one single file whose size equalled that of the partition that it represented. Rewriting such a Log to rid it of invalidated objects requires a large temporary space (size equal to that of the partition), wastes I/O cycles, and creates periods where writes are disabled. To overcome these limitations, we re-architected the Log into segments.
Log as segments
As a first step towards implementing a well-crafted compaction process, we decided to model the Log as an abstraction over a set of equal-sized files called log segments. These files are ordered logically, share no records and are indexed by a non overlapping set of index segments. New writes are directed to the “lowest” (in the logical order) log segment that isn’t full. Once a log segment has reached its size limit, it is sealed (made read-only) and subsequent writes are directed to the next log segment in the logical order.
Figure 2: Log and Index as segments
A log segment is logically represented by two numbers. One of them represents the relative position of the log segment in the Log abstraction. The other represents the generation of the log segment, which is incremented as it undergoes compaction. An index segment is logically represented by two pieces of information. The first is a reference to the log segment it indexes and the other represents its relative position among all the index segments of the log segment. Figure 2 shows how the log and index segments are organized.
By treating the Log as a collection of segments rather than a single file, the following holds:
- Only the last (appendable) log segment and the corresponding index segments persist user writes. This affords a non-blocking implementation where writes can proceed on the last segment while we compact other segments.
- The (in)validity of an object can be determined independently of any other object present in the Store or cluster.
- A log segment and its associated index segment(s) can be re-written independently of other log segment(s) in the Log. This means that:
The size of the temporary space required is equal to that of a single segment;
Log segments can be processed in parallel if required.
Using the refined Log architecture, it is easier to design a framework for compaction.
Design of compaction
In the context of log segments, compaction is a process that sweeps through a single log segment or a list of contiguous log segments, selects valid objects, and copies them into new segment(s). At the end of the process, it substitutes the original segment(s) with the newly-created one(s). The next section describes these phases of compaction in more detail.
Phases of compaction
Tasks performed in this phase can include sanity checks (e.g., ensuring that the list of log segments to compact is in the right order, is contiguous, and does not include the last log segment), stopping any other jobs that perform maintenance on the candidate log segment(s), securing temporary working space(s) (each is equal to the size of one log segment), and creating a data structure to track and persist compaction progress (for resuming after restarts or recovering from abrupt terminations).
This phase represents the core of the compaction process. In this phase, valid objects (objects that have neither been deleted nor have auto-expired) from the candidate log segment(s) are copied to the temporary space. Since this phase is I/O intensive, copying is throttled and progress is rigorously recorded to minimize the recovery work required in case of shutdowns or crashes. Figure 3 describes a high-level algorithm for this phase at the granularity of a log segment.
Figure 3: Log segment copy algorithm
In this phase, the newly created log segment(s) and index segment(s) replace the old ones in the Log and Index. The key to the switch is the atomic replacement in the Index because user requests reach log segment(s) only through the Index. Once the replacement in the Index is complete, all subsequent read operations will be served from the new log segment(s). The implementation ensures that this phase takes no locks and does not interrupt any user reads or writes.
This phase refers to the necessary cleanup of the files on disk that represent the replaced log segment(s) and index segment(s). Due to in-progress request serving, it is possible that the old log segment(s) are still in use for a short period of time. This phase ensures that the transition is smooth and returns the newly-freed segments to the temporary space pool after the transition is complete. It may optionally also restart any services that were stopped in the preparation phase and finalize the compaction progress tracker to indicate completion.
Figure 4 depicts the states before and after compaction of two log segments whose valid data fits into one log segment. Notice the increment in the generation number of Segment 0.
Figure 4: Store before and after compaction
Limited temporary space
Compaction is designed to work as long as there is a temporary working space, at least as large as a single log segment, available. In the scenario of limited temporary space, the cycles above are repeated as long as there are candidate log segment(s) to compact. This is possible because every iteration of all four phases returns at least one log segment back to the pool. Figure 5 is an algorithm for compaction of multiple log segments taking into account the fact that temporary working space may be limited.
Figure 5: Overall compaction algorithm
Completing the picture
So far, we’ve discussed the process of compaction. To complete the discussion, we have to examine two critical pieces of the puzzle:
- Selection strategy: choosing the log segment(s) eligible for compaction.
- Execution strategy: the actual engine that executes compaction.
Selection strategies can have diverse goals. While one strategy may prefer a full sweep, another might want to avoid wasting limited I/O bandwidth by intelligently selecting the log segment(s) that are suitable for compaction, while yet another may prefer an amalgamation of the two. To that end, the design and implementation of compaction allow for custom selection strategies. At the time of writing this post, Ambry implements two of these strategies:
- CompactAll: All the log segment(s) except the one currently in use for user requests are selected for compaction. While simple, this strategy could consume a lot of I/O bandwidth (e.g., low rate of deletes/auto-expiries, differences in the characteristics of recent vs. old data). However, it is suitable for data compliance (erasing deleted/expired objects completely in a bounded timeline) and to detect corruption and disk problems.
- StatsBased: In this strategy, a dedicated module collects statistics about the number and total size of invalid objects and runs a simple cost-benefit algorithm (cost being the amount of data that will be copied, benefit being the number of log segment(s) that will be freed) to select a contiguous series of log segment(s) that are eligible for compaction. This strategy aims to minimize I/O bandwidth use but does not guarantee that all invalid objects will eventually be cleaned up.
Owing to the design, each Store could theoretically execute compaction independently of other Store instances (in fact, compactions on non overlapping log segment(s) inside a single Store can execute independently of each other). But practical considerations, like available disk bandwidth, necessitate the introduction of some dependencies between Store instances that share common physical resources. The granularity of such sharing can be naturally limited to some level of coordination between all Store instances on the same disk. This is captured by the fact that both the threads available to execute compaction and the usable disk bandwidth are regulated at the level of the disk (see Figure 6).
Figure 6: Compaction control flow
Testing and operationalization
Bugs in the design or implementation of compaction could result in the partial or total loss of data. The current implementation comes with a comprehensive set of unit tests covering most scenarios. Operationalizing compaction at LinkedIn was a considered process with rigorous testing and a deliberate and slow rollout. We developed tools (available in the repository) for comparison of objects in the Log and Index before and after compaction and for logical comparison between replicas. We ran these tools aggressively over synthetic data for many days and effected shutdowns and crashes at specific and random points in the execution to ensure correctness of recovery. Once testing over synthetic data was complete, we ran compaction for a few weeks in our performance and staging environments and used replicas of the same partition on other storage servers to compare logical equivalence.
In production, we ran compaction for many weeks on a single storage server. We watched for anomalies (primarily for error rate changes and a change in the number of objects not found) and we checked for logical equivalence against replicas of the same partition on other storage servers aggressively using specialized tooling (available in the repository). The next stage of the rollout was to one site, followed by all sites except one, which we preserved as a safeguard. Once completely confident in the implementation, we rolled it out globally.
Learnings from production
Deploying compaction at scale uncovered a few challenges related to efficient and non-intrusive use of I/O bandwidth, including:
Though throttled, compaction can still interfere with the latency of user requests during periods of high user traffic because the configuration may assume average traffic.
By nature, compaction transfers bytes at a high volume to the page cache. This reduces the amount of page cache available for user reads and writes, in addition to evicting memory-mapped index segments from RAM.
A global configuration for throttling is not feasible because of the disparate capabilities of disks in a heterogeneous cluster. On the other hand, maintaining custom configurations for each server is error-prone, imprecise, and costly.
We have addressed some of these challenges (e.g., we flush writes by the compaction process more often to disk in order to reduce the number of dirty pages in the page cache) and continue to work on more mature solutions.
Further, we learned that tuning compaction requires thinking about and making multiple trade-offs based on usage patterns. These can range from the decision on the size of the log segments (too large and compaction will not reclaim space fast enough, since it can only reclaim space in log-segment-sized chunks; too small and there will be too many files on disk and in in-memory structures, which may affect performance) to the frequency of running compaction (this can be automated by more intelligent selection policy implementations).
We continue to iterate on compaction in two directions: performance and supporting features that the compaction paradigm unlocks. A limited, inexhaustive list includes:
- Feedback-driven throttling: Manual configuration of throttling is not feasible in a heterogeneous cluster that has different types of disks and experiences different levels and distributions of traffic. An alternate approach could be to define a minimum rate and let the throttler adjust dynamically based on feedback. Feedback can be modelled in different ways and may include multiple variables, including user request latency, time of day, and percentage of disk bandwidth used.
- Use of direct I/O: Using direct I/O can be explored to avoid the use of page cache during compaction. While this may slow down the compaction process (because writes no longer have the luxury of writing to the page cache), it is expected to positively impact user request latency.
- Using compaction for operations like container deletion or to mark objects as eligible for cold storage: Compaction can be configured to regularly and reliably scan storage cluster-wide and so is the natural place to do these operations.
- Improvement of the statistics collection and the selection process: The implementation of compaction is orthogonal to the process of selecting compaction candidate(s). This allows adopters to plug in custom implementations.
LinkedIn has a strong commitment towards open source and the majority of the development of Ambry happens on our GitHub repository. The version that powers LinkedIn is the latest stable version from the repository, rather than a private, internal one. Contributions are welcome and encouraged. The repository contains guidelines for contribution and comprehensive documentation is available both in the wiki and inline in the code. You can also ask questions or start a discussion at firstname.lastname@example.org.
Compaction has enabled Ambry to use the storage available to it more efficiently, reduce operational costs, and be more attractive for adoption. The process of designing and implementing compaction in Ambry has had positive impact beyond the requirement of reducing storage costs. The components we have added as part of this effort have formed cornerstones for other projects. As a concrete example, the statistics collection module has enabled the publication of cluster-wide statistics about accounts and containers, providing Ambry tenants insights about their usage.
Initially envisioned as a system optimization, compaction also has other applications. In the context of compliance, compaction can serve as a mechanism to completely “forget” defunct data. It can also serve as a health monitor for storage, as it forces access to all data on the disk, potentially uncovering any corruption or bad sectors. Furthermore, the introduction of the powerful new paradigm of asynchronous execution of cluster-wide operations unlocks new possibilities and use cases.
Building an object store that can serve a multitude of use cases while keeping costs low is both challenging and fun. From design to operationalization, this project has enjoyed the contributions, suggestions, and support of many people. Big thanks for contributions in development from Priyesh Narayanan, Ashish Singhai, Ivo Dimitrov, Ze Mao, Sivabalan Narayanan, Xun Yin, Casey Getz, Rob Block, Yingyi Zhang, and David Harju. Dmitry Nikiforov, Bharat Patel, and the extended site reliability team of Ambry have been pivotal in operationalizing and tuning compaction to work well in production. Executing and productionalizing compaction would not have been possible without the constant support and leadership from Matthew Wise, Ashish Singhai, and Ivo Dimitrov. Last but not the least, thanks to Shubham Gupta, Walaa Eldin Moustafa, Banu Muthukumar, Jaren Anderson, and Stephen Lynch for helping refine this post with their valuable reviews and feedback.