mirror of
https://source.quilibrium.com/quilibrium/ceremonyclient.git
synced 2025-01-26 15:47:11 +00:00
758 lines
36 KiB
Markdown
758 lines
36 KiB
Markdown
# Pebble vs RocksDB: Implementation Differences
|
||
|
||
RocksDB is a key-value store implemented using a Log-Structured
|
||
Merge-Tree (LSM). This document is not a primer on LSMs. There exist
|
||
some decent
|
||
[introductions](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/)
|
||
on the web, or try chapter 3 of [Designing Data-Intensive
|
||
Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321).
|
||
|
||
Pebble inherits the RocksDB file formats, has a similar API, and
|
||
shares many implementation details, but it also has many differences
|
||
that improve performance, reduce implementation complexity, or extend
|
||
functionality. This document highlights some of the more important
|
||
differences.
|
||
|
||
* [Internal Keys](#internal-keys)
|
||
* [Indexed Batches](#indexed-batches)
|
||
* [Large Batches](#large-batches)
|
||
* [Commit Pipeline](#commit-pipeline)
|
||
* [Range Deletions](#range-deletions)
|
||
* [Flush and Compaction Pacing](#flush-and-compaction-pacing)
|
||
* [Write Throttling](#write-throttling)
|
||
* [Other Differences](#other-differences)
|
||
|
||
## Internal Keys
|
||
|
||
The external RocksDB API accepts keys and values. Due to the LSM
|
||
structure, keys are never updated in place, but overwritten with new
|
||
versions. Inside RocksDB, these versioned keys are known as Internal
|
||
Keys. An Internal Key is composed of the user specified key, a
|
||
sequence number and a kind. On disk, sstables always store Internal
|
||
Keys.
|
||
|
||
```
|
||
+-------------+------------+----------+
|
||
| UserKey (N) | SeqNum (7) | Kind (1) |
|
||
+-------------+------------+----------+
|
||
```
|
||
|
||
The `Kind` field indicates the type of key: set, merge, delete, etc.
|
||
|
||
While Pebble inherits the Internal Key encoding for format
|
||
compatibility, it diverges from RocksDB in how it manages Internal
|
||
Keys in its implementation. In RocksDB, Internal Keys are represented
|
||
either in encoded form (as a string) or as a `ParsedInternalKey`. The
|
||
latter is a struct with the components of the Internal Key as three
|
||
separate fields.
|
||
|
||
```c++
|
||
struct ParsedInternalKey {
|
||
Slice user_key;
|
||
uint64 seqnum;
|
||
uint8 kind;
|
||
}
|
||
```
|
||
|
||
The component format is convenient: changing the `SeqNum` or `Kind` is
|
||
field assignment. Extracting the `UserKey` is a field
|
||
reference. However, RocksDB tends to only use `ParsedInternalKey`
|
||
locally. The major internal APIs, such as `InternalIterator`, operate
|
||
using encoded internal keys (i.e. strings) for parameters and return
|
||
values.
|
||
|
||
To give a concrete example of the overhead this causes, consider
|
||
`Iterator::Seek(user_key)`. The external `Iterator` is implemented on
|
||
top of an `InternalIterator`. `Iterator::Seek` ends up calling
|
||
`InternalIterator::Seek`. Both Seek methods take a key, but
|
||
`InternalIterator::Seek` expects an encoded Internal Key. This is both
|
||
error prone and expensive. The key passed to `Iterator::Seek` needs to
|
||
be copied into a temporary string in order to append the `SeqNum` and
|
||
`Kind`. In Pebble, Internal Keys are represented in memory using an
|
||
`InternalKey` struct that is the analog of `ParsedInternalKey`. All
|
||
internal APIs use `InternalKeys`, with the exception of the lowest
|
||
level routines for decoding data from sstables. In Pebble, since the
|
||
interfaces all take and return the `InternalKey` struct, we don’t need
|
||
to allocate to construct the Internal Key from the User Key, but
|
||
RocksDB sometimes needs to allocate, and encode (i.e. make a
|
||
copy). The use of the encoded form also causes RocksDB to pass encoded
|
||
keys to the comparator routines, sometimes decoding the keys multiple
|
||
times during the course of processing.
|
||
|
||
## Indexed Batches
|
||
|
||
In RocksDB, a batch is the unit for all write operations. Even writing
|
||
a single key is transformed internally to a batch. The batch internal
|
||
representation is a contiguous byte buffer with a fixed 12-byte
|
||
header, followed by a series of records.
|
||
|
||
```
|
||
+------------+-----------+--- ... ---+
|
||
| SeqNum (8) | Count (4) | Entries |
|
||
+------------+-----------+--- ... ---+
|
||
```
|
||
|
||
Each record has a 1-byte kind tag prefix, followed by 1 or 2 length
|
||
prefixed strings (varstring):
|
||
|
||
```
|
||
+----------+-----------------+-------------------+
|
||
| Kind (1) | Key (varstring) | Value (varstring) |
|
||
+----------+-----------------+-------------------+
|
||
```
|
||
|
||
(The `Kind` indicates if there are 1 or 2 varstrings. `Set`, `Merge`,
|
||
and `DeleteRange` have 2 varstrings, while `Delete` has 1.)
|
||
|
||
Adding a mutation to a batch involves appending a new record to the
|
||
buffer. This format is extremely fast for writes, but the lack of
|
||
indexing makes it untenable to use directly for reads. In order to
|
||
support iteration, a separate indexing structure is created. Both
|
||
RocksDB and Pebble use a skiplist for the indexing structure, but with
|
||
a clever twist. Rather than the skiplist storing a copy of the key, it
|
||
simply stores the offset of the record within the mutation buffer. The
|
||
result is that the skiplist acts a multi-map (i.e. a map that can have
|
||
duplicate entries for a given key). The iteration order for this map
|
||
is constructed so that records sort on key, and for equal keys they
|
||
sort on descending offset. Newer records for the same key appear
|
||
before older records.
|
||
|
||
While the indexing structure for batches is nearly identical between
|
||
RocksDB and Pebble, how the index structure is used is completely
|
||
different. In RocksDB, a batch is indexed using the
|
||
`WriteBatchWithIndex` class. The `WriteBatchWithIndex` class provides
|
||
a `NewIteratorWithBase` method that allows iteration over the merged
|
||
view of the batch contents and an underlying "base" iterator created
|
||
from the database. `BaseDeltaIterator` contains logic to iterate over
|
||
the batch entries and the base iterator in parallel which allows us to
|
||
perform reads on a snapshot of the database as though the batch had
|
||
been applied to it. On the surface this sounds reasonable, yet the
|
||
implementation is incomplete. Merge and DeleteRange operations are not
|
||
supported. The reason they are not supported is because handling them
|
||
is complex and requires duplicating logic that already exists inside
|
||
RocksDB for normal iterator processing.
|
||
|
||
Pebble takes a different approach to iterating over a merged view of a
|
||
batch's contents and the underlying database: it treats the batch as
|
||
another level in the LSM. Recall that an LSM is composed of zero or
|
||
more memtable layers and zero or more sstable layers. Internally, both
|
||
RocksDB and Pebble contain a `MergingIterator` that knows how to merge
|
||
the operations from different levels, including processing overwritten
|
||
keys, merge operations, and delete range operations. The challenge
|
||
with treating the batch as another level to be used by a
|
||
`MergingIterator` is that the records in a batch do not have a
|
||
sequence number. The sequence number in the batch header is not
|
||
assigned until the batch is committed. The solution is to give the
|
||
batch records temporary sequence numbers. We need these temporary
|
||
sequence numbers to be larger than any other sequence number in the
|
||
database so that the records in the batch are considered newer than
|
||
any committed record. This is accomplished by reserving the high-bit
|
||
in the 56-bit sequence number for use as a marker for batch sequence
|
||
numbers. The sequence number for a record in an uncommitted batch is:
|
||
|
||
```
|
||
RecordOffset | (1<<55)
|
||
```
|
||
|
||
Newer records in a given batch will have a larger sequence number than
|
||
older records in the batch. And all of the records in a batch will
|
||
have larger sequence numbers than any committed record in the
|
||
database.
|
||
|
||
The end result is that Pebble's batch iterators support all of the
|
||
functionality of regular database iterators with minimal additional
|
||
code.
|
||
|
||
## Large Batches
|
||
|
||
The size of a batch is limited only by available memory, yet the
|
||
required memory is not just the batch representation. When a batch is
|
||
committed, the commit operation iterates over the records in the batch
|
||
from oldest to newest and inserts them into the current memtable. The
|
||
memtable is an in-memory structure that buffers mutations that have
|
||
been committed (written to the Write Ahead Log), but not yet written
|
||
to an sstable. Internally, a memtable uses a skiplist to index
|
||
records. Each skiplist entry has overhead for the index links and
|
||
other metadata that is a dozen bytes at minimum. A large batch
|
||
composed of many small records can require twice as much memory when
|
||
inserted into a memtable than it required in the batch. And note that
|
||
this causes a temporary increase in memory requirements because the
|
||
batch memory is not freed until it is completely committed.
|
||
|
||
A non-obvious implementation restriction present in both RocksDB and
|
||
Pebble is that there is a one-to-one correspondence between WAL files
|
||
and memtables. That is, a given WAL file has a single memtable
|
||
associated with it and vice-versa. While this restriction could be
|
||
removed, doing so is onerous and intricate. It should also be noted
|
||
that committing a batch involves writing it to a single WAL file. The
|
||
combination of restrictions results in a batch needing to be written
|
||
entirely to a single memtable.
|
||
|
||
What happens if a batch is too large to fit in a memtable? Memtables
|
||
are generally considered to have a fixed size, yet this is not
|
||
actually true in RocksDB. In RocksDB, the memtable skiplist is
|
||
implemented on top of an arena structure. An arena is composed of a
|
||
list of fixed size chunks, with no upper limit set for the number of
|
||
chunks that can be associated with an arena. So RocksDB handles large
|
||
batches by allowing a memtable to grow beyond its configured
|
||
size. Concretely, while RocksDB may be configured with a 64MB memtable
|
||
size, a 1GB batch will cause the memtable to grow to accomodate
|
||
it. Functionally, this is good, though there is a practical problem: a
|
||
large batch is first written to the WAL, and then added to the
|
||
memtable. Adding the large batch to the memtable may consume so much
|
||
memory that the system runs out of memory and is killed by the
|
||
kernel. This can result in a death loop because upon restarting as the
|
||
batch is read from the WAL and applied to the memtable again.
|
||
|
||
In Pebble, the memtable is also implemented using a skiplist on top of
|
||
an arena. Significantly, the Pebble arena is a fixed size. While the
|
||
RocksDB skiplist uses pointers, the Pebble skiplist uses offsets from
|
||
the start of the arena. The fixed size arena means that the Pebble
|
||
memtable cannot expand arbitrarily. A batch that is too large to fit
|
||
in the memtable causes the current mutable memtable to be marked as
|
||
immutable and the batch is wrapped in a `flushableBatch` structure and
|
||
added to the list of immutable memtables. Because the `flushableBatch`
|
||
is readable as another layer in the LSM, the batch commit can return
|
||
as soon as the `flushableBatch` has been added to the immutable
|
||
memtable list.
|
||
|
||
Internally, a `flushableBatch` provides iterator support by sorting
|
||
the batch contents (the batch is sorted once, when it is added to the
|
||
memtable list). Sorting the batch contents and insertion of the
|
||
contents into a memtable have the same big-O time, but the constant
|
||
factor dominates here. Sorting is significantly faster and uses
|
||
significantly less memory due to not having to copy the batch records.
|
||
|
||
Note that an effect of this large batch support is that Pebble can be
|
||
configured as an efficient on-disk sorter: specify a small memtable
|
||
size, disable the WAL, and set a large L0 compaction threshold. In
|
||
order to sort a large amount of data, create batches that are larger
|
||
than the memtable size and commit them. When committed these batches
|
||
will not be inserted into a memtable, but instead sorted and then
|
||
written out to L0. The fully sorted data can later be read and the
|
||
normal merging process will take care of the final ordering.
|
||
|
||
## Commit Pipeline
|
||
|
||
The commit pipeline is the component which manages the steps in
|
||
committing write batches, such as writing the batch to the WAL and
|
||
applying its contents to the memtable. While simple conceptually, the
|
||
commit pipeline is crucial for high performance. In the absence of
|
||
concurrency, commit performance is limited by how fast a batch can be
|
||
written (and synced) to the WAL and then added to the memtable, both
|
||
of which are outside of the purview of the commit pipeline.
|
||
|
||
To understand the challenge here, it is useful to have a conception of
|
||
the WAL (write-ahead log). The WAL contains a record of all of the
|
||
batches that have been committed to the database. As a record is
|
||
written to the WAL it is added to the memtable. Each record is
|
||
assigned a sequence number which is used to distinguish newer updates
|
||
from older ones. Conceptually the WAL looks like:
|
||
|
||
```
|
||
+--------------------------------------+
|
||
| Batch(SeqNum=1,Count=9,Records=...) |
|
||
+--------------------------------------+
|
||
| Batch(SeqNum=10,Count=5,Records=...) |
|
||
+--------------------------------------+
|
||
| Batch(SeqNum=15,Count=7,Records...) |
|
||
+--------------------------------------+
|
||
| ... |
|
||
+--------------------------------------+
|
||
```
|
||
|
||
Note that each WAL entry is precisely the batch representation
|
||
described earlier in the [Indexed Batches](#indexed-batches)
|
||
section. The monotonically increasing sequence numbers are a critical
|
||
component in allowing RocksDB and Pebble to provide fast snapshot
|
||
views of the database for reads.
|
||
|
||
If concurrent performance was not a concern, the commit pipeline could
|
||
simply be a mutex which serialized writes to the WAL and application
|
||
of the batch records to the memtable. Concurrent performance is a
|
||
concern, though.
|
||
|
||
The primary challenge in concurrent performance in the commit pipeline
|
||
is maintaining two invariants:
|
||
|
||
1. Batches need to be written to the WAL in sequence number order.
|
||
2. Batches need to be made visible for reads in sequence number
|
||
order. This invariant arises from the use of a single sequence
|
||
number which indicates which mutations are visible.
|
||
|
||
The second invariant deserves explanation. RocksDB and Pebble both
|
||
keep track of a visible sequence number. This is the sequence number
|
||
for which records in the database are visible during reads. The
|
||
visible sequence number exists because committing a batch is an atomic
|
||
operation, yet adding records to the memtable is done without an
|
||
exclusive lock (the skiplists used by both Pebble and RocksDB are
|
||
lock-free). When the records from a batch are being added to the
|
||
memtable, a concurrent read operation may see those records, but will
|
||
skip over them because they are newer than the visible sequence
|
||
number. Once all of the records in the batch have been added to the
|
||
memtable, the visible sequence number is atomically incremented.
|
||
|
||
So we have four steps in committing a write batch:
|
||
|
||
1. Write the batch to the WAL
|
||
2. Apply the mutations in the batch to the memtable
|
||
3. Bump the visible sequence number
|
||
4. (Optionally) sync the WAL
|
||
|
||
Writing the batch to the WAL is actually very fast as it is just a
|
||
memory copy. Applying the mutations in the batch to the memtable is by
|
||
far the most CPU intensive part of the commit pipeline. Syncing the
|
||
WAL is the most expensive from a wall clock perspective.
|
||
|
||
With that background out of the way, let's examine how RocksDB commits
|
||
batches. This description is of the traditional commit pipeline in
|
||
RocksDB (i.e. the one used by CockroachDB).
|
||
|
||
RocksDB achieves concurrency in the commit pipeline by grouping
|
||
concurrently committed batches into a batch group. Each group is
|
||
assigned a "leader" which is the first batch to be added to the
|
||
group. The batch group is written atomically to the WAL by the leader
|
||
thread, and then the individual batches making up the group are
|
||
concurrently applied to the memtable. Lastly, the visible sequence
|
||
number is bumped such that all of the batches in the group become
|
||
visible in a single atomic step. While a batch group is being applied,
|
||
other concurrent commits are added to a waiting list. When the group
|
||
commit finishes, the waiting commits form the next group.
|
||
|
||
There are two criticisms of the batch grouping approach. The first is
|
||
that forming a batch group involves copying batch contents. RocksDB
|
||
partially alleviates this for large batches by placing a limit on the
|
||
total size of a group. A large batch will end up in its own group and
|
||
not be copied, but the criticism still applies for small batches. Note
|
||
that there are actually two copies here. The batch contents are
|
||
concatenated together to form the group, and then the group contents
|
||
are written into an in memory buffer for the WAL before being written
|
||
to disk.
|
||
|
||
The second criticism is about the thread synchronization points. Let's
|
||
consider what happens to a commit which becomes the leader:
|
||
|
||
1. Lock commit mutex
|
||
2. Wait to become leader
|
||
3. Form (concatenate) batch group and write to the WAL
|
||
4. Notify followers to apply their batch to the memtable
|
||
5. Apply own batch to memtable
|
||
6. Wait for followers to finish
|
||
7. Bump visible sequence number
|
||
8. Unlock commit mutex
|
||
9. Notify followers that the commit is complete
|
||
|
||
The follower's set of operations looks like:
|
||
|
||
1. Lock commit mutex
|
||
2. Wait to become follower
|
||
3. Wait to be notified that it is time to apply batch
|
||
4. Unlock commit mutex
|
||
5. Apply batch to memtable
|
||
6. Wait to be notified that commit is complete
|
||
|
||
The thread synchronization points (all of the waits and notifies) are
|
||
overhead. Reducing that overhead can improve performance.
|
||
|
||
The Pebble commit pipeline addresses both criticisms. The main
|
||
innovation is a commit queue that mirrors the commit order. The Pebble
|
||
commit pipeline looks like:
|
||
|
||
1. Lock commit mutex
|
||
* Add batch to commit queue
|
||
* Assign batch sequence number
|
||
* Write batch to the WAL
|
||
2. Unlock commit mutex
|
||
3. Apply batch to memtable (concurrently)
|
||
4. Publish batch sequence number
|
||
|
||
Pebble does not use the concept of a batch group. Each batch is
|
||
individually written to the WAL, but note that the WAL write is just a
|
||
memory copy into an internal buffer in the WAL.
|
||
|
||
Step 4 deserves further scrutiny as it is where the invariant on the
|
||
visible batch sequence number is maintained. Publishing the batch
|
||
sequence number cannot simply bump the visible sequence number because
|
||
batches with earlier sequence numbers may still be applying to the
|
||
memtable. If we were to ratchet the visible sequence number without
|
||
waiting for those applies to finish, a concurrent reader could see
|
||
partial batch contents. Note that RocksDB has experimented with
|
||
allowing these semantics with its unordered writes option.
|
||
|
||
We want to retain the atomic visibility of batch commits. The publish
|
||
batch sequence number step needs to ensure that we don't ratchet the
|
||
visible sequence number until all batches with earlier sequence
|
||
numbers have applied. Enter the commit queue: a lock-free
|
||
single-producer, multi-consumer queue. Batches are added to the commit
|
||
queue with the commit mutex held, ensuring the same order as the
|
||
sequence number assignment. After a batch finishes applying to the
|
||
memtable, it atomically marks the batch as applied. It then removes
|
||
the prefix of applied batches from the commit queue, bumping the
|
||
visible sequence number, and marking the batch as committed (via a
|
||
`sync.WaitGroup`). If the first batch in the commit queue has not be
|
||
applied we wait for our batch to be committed, relying on another
|
||
concurrent committer to perform the visible sequence ratcheting for
|
||
our batch. We know a concurrent commit is taking place because if
|
||
there was only one batch committing it would be at the head of the
|
||
commit queue.
|
||
|
||
There are two possibilities when publishing a sequence number. The
|
||
first is that there is an unapplied batch at the head of the
|
||
queue. Consider the following scenario where we're trying to publish
|
||
the sequence number for batch `B`.
|
||
|
||
```
|
||
+---------------+-------------+---------------+-----+
|
||
| A (unapplied) | B (applied) | C (unapplied) | ... |
|
||
+---------------+-------------+---------------+-----+
|
||
```
|
||
|
||
The publish routine will see that `A` is unapplied and then simply
|
||
wait for `B's` done `sync.WaitGroup` to be signalled. This is safe
|
||
because `A` must still be committing. And if `A` has concurrently been
|
||
marked as applied, the goroutine publishing `A` will then publish
|
||
`B`. What happens when `A` publishes its sequence number? The commit
|
||
queue state becomes:
|
||
|
||
```
|
||
+-------------+-------------+---------------+-----+
|
||
| A (applied) | B (applied) | C (unapplied) | ... |
|
||
+-------------+-------------+---------------+-----+
|
||
```
|
||
|
||
The publish routine pops `A` from the queue, ratchets the sequence
|
||
number, then pops `B` and ratchets the sequence number again, and then
|
||
finds `C` and stops. A detail that it is important to notice is that
|
||
the committer for batch `B` didn't have to do any more work. An
|
||
alternative approach would be to have `B` wakeup and ratchet its own
|
||
sequence number, but that would serialize the remainder of the commit
|
||
queue behind that goroutine waking up.
|
||
|
||
The commit queue reduces the number of thread synchronization
|
||
operations required to commit a batch. There is no leader to notify,
|
||
or followers to wait for. A commit either publishes its own sequence
|
||
number, or performs one synchronization operation to wait for a
|
||
concurrent committer to publish its sequence number.
|
||
|
||
## Range Deletions
|
||
|
||
Deletion of an individual key in RocksDB and Pebble is accomplished by
|
||
writing a deletion tombstone. A deletion tombstone shadows an existing
|
||
value for a key, causing reads to treat the key as not present. The
|
||
deletion tombstone mechanism works well for deleting small sets of
|
||
keys, but what happens if you want to all of the keys within a range
|
||
of keys that might number in the thousands or millions? A range
|
||
deletion is an operation which deletes an entire range of keys with a
|
||
single record. In contrast to a point deletion tombstone which
|
||
specifies a single key, a range deletion tombstone (a.k.a. range
|
||
tombstone) specifies a start key (inclusive) and an end key
|
||
(exclusive). This single record is much faster to write than thousands
|
||
or millions of point deletion tombstones, and can be done blindly --
|
||
without iterating over the keys that need to be deleted. The downside
|
||
to range tombstones is that they require additional processing during
|
||
reads. How the processing of range tombstones is done significantly
|
||
affects both the complexity of the implementation, and the efficiency
|
||
of read operations in the presence of range tombstones.
|
||
|
||
A range tombstone is composed of a start key, end key, and sequence
|
||
number. Any key that falls within the range is considered deleted if
|
||
the key's sequence number is less than the range tombstone's sequence
|
||
number. RocksDB stores range tombstones segregated from point
|
||
operations in a special range deletion block within each sstable.
|
||
Conceptually, the range tombstones stored within an sstable are
|
||
truncated to the boundaries of the sstable, though there are
|
||
complexities that cause this to not actually be physically true.
|
||
|
||
In RocksDB, the main structure implementing range tombstone processing
|
||
is the `RangeDelAggregator`. Each read operation and iterator has its
|
||
own `RangeDelAggregator` configured for the sequence number the read
|
||
is taking place at. The initial implementation of `RangeDelAggregator`
|
||
built up a "skyline" for the range tombstones visible at the read
|
||
sequence number.
|
||
|
||
```
|
||
10 +---+
|
||
9 | |
|
||
8 | |
|
||
7 | +----+
|
||
6 | |
|
||
5 +-+ | +----+
|
||
4 | | | |
|
||
3 | | | +---+
|
||
2 | | | |
|
||
1 | | | |
|
||
0 | | | |
|
||
abcdefghijklmnopqrstuvwxyz
|
||
```
|
||
|
||
The above diagram shows the skyline created for the range tombstones
|
||
`[b,j)#5`, `[d,h)#10`, `[f,m)#7`, `[p,u)#5`, and `[t,y)#3`. The
|
||
skyline is queried for each key read to see if the key should be
|
||
considered deleted or not. The skyline structure is stored in a binary
|
||
tree, making the queries an O(logn) operation in the number of
|
||
tombstones, though there is an optimization to make this O(1) for
|
||
`next`/`prev` iteration. Note that the skyline representation loses
|
||
information about the range tombstones. This requires the structure to
|
||
be rebuilt on every read which has a significant performance impact.
|
||
|
||
The initial skyline range tombstone implementation has since been
|
||
replaced with a more efficient lookup structure. See the
|
||
[DeleteRange](https://rocksdb.org/blog/2018/11/21/delete-range.html)
|
||
blog post for a good description of both the original implementation
|
||
and the new (v2) implementation. The key change in the new
|
||
implementation is to "fragment" the range tombstones that are stored
|
||
in an sstable. The fragmented range tombstones provide the same
|
||
benefit as the skyline representation: the ability to binary search
|
||
the fragments in order to find the tombstone covering a key. But
|
||
unlike the skyline approach, the fragmented tombstones can be cached
|
||
on a per-sstable basis. In the v2 approach, `RangeDelAggregator` keeps
|
||
track of the fragmented range tombstones for each sstable encountered
|
||
during a read or iterator, and logically merges them together.
|
||
|
||
Fragmenting range tombstones involves splitting range tombstones at
|
||
overlap points. Let's consider the tombstones in the skyline example
|
||
above:
|
||
|
||
```
|
||
10: d---h
|
||
7: f------m
|
||
5: b-------j p----u
|
||
3: t----y
|
||
```
|
||
|
||
Fragmenting the range tombstones at the overlap points creates a
|
||
larger number of range tombstones:
|
||
|
||
```
|
||
10: d-f-h
|
||
7: f-h-j--m
|
||
5: b-d-f-h-j p---tu
|
||
3: tu---y
|
||
```
|
||
|
||
While the number of tombstones is larger there is a significant
|
||
advantage: we can order the tombstones by their start key and then
|
||
binary search to find the set of tombstones overlapping a particular
|
||
point. This is possible because due to the fragmenting, all the
|
||
tombstones that overlap a range of keys will have the same start and
|
||
end key. The v2 `RangeDelAggregator` and associated classes perform
|
||
fragmentation of range tombstones stored in each sstable and those
|
||
fragmented tombstones are then cached.
|
||
|
||
In summary, in RocksDB `RangeDelAggregator` acts as an oracle for
|
||
answering whether a key is deleted at a particular sequence
|
||
number. Due to caching of fragmented tombstones, the v2 implementation
|
||
of `RangeDelAggregator` implementation is significantly faster to
|
||
populate than v1, yet the overall approach to processing range
|
||
tombstones remains similar.
|
||
|
||
Pebble takes a different approach: it integrates range tombstones
|
||
processing directly into the `mergingIter` structure. `mergingIter` is
|
||
the internal structure which provides a merged view of the levels in
|
||
an LSM. RocksDB has a similar class named
|
||
`MergingIterator`. Internally, `mergingIter` maintains a heap over the
|
||
levels in the LSM (note that each memtable and L0 table is a separate
|
||
"level" in `mergingIter`). In RocksDB, `MergingIterator` knows nothing
|
||
about range tombstones, and it is thus up to higher-level code to
|
||
process range tombstones using `RangeDelAggregator`.
|
||
|
||
While the separation of `MergingIterator` and range tombstones seems
|
||
reasonable at first glance, there is an optimization that RocksDB does
|
||
not perform which is awkward with the `RangeDelAggregator` approach:
|
||
skipping swaths of deleted keys. A range tombstone often shadows more
|
||
than one key. Rather than iterating over the deleted keys, it is much
|
||
quicker to seek to the end point of the range tombstone. The challenge
|
||
in implementing this optimization is that a key might be newer than
|
||
the range tombstone and thus shouldn't be skipped. An insight to be
|
||
utilized is that the level structure itself provides sufficient
|
||
information. A range tombstone at `Ln` is guaranteed to be newer than
|
||
any key it overlaps in `Ln+1`.
|
||
|
||
Pebble utilizes the insight above to integrate range deletion
|
||
processing with `mergingIter`. A `mergingIter` maintains a point
|
||
iterator and a range deletion iterator per level in the LSM. In this
|
||
context, every L0 table is a separate level, as is every
|
||
memtable. Within a level, when a range deletion contains a point
|
||
operation the sequence numbers must be checked to determine if the
|
||
point operation is newer or older than the range deletion
|
||
tombstone. The `mergingIter` maintains the invariant that the range
|
||
deletion iterators for all levels newer that the current iteration key
|
||
are positioned at the next (or previous during reverse iteration)
|
||
range deletion tombstone. We know those levels don't contain a range
|
||
deletion tombstone that covers the current key because if they did the
|
||
current key would be deleted. The range deletion iterator for the
|
||
current key's level is positioned at a range tombstone covering or
|
||
past the current key. The position of all of other range deletion
|
||
iterators is unspecified. Whenever a key from those levels becomes the
|
||
current key, their range deletion iterators need to be
|
||
positioned. This lazy positioning avoids seeking the range deletion
|
||
iterators for keys that are never considered.
|
||
|
||
For a full example, consider the following setup:
|
||
|
||
```
|
||
p0: o
|
||
r0: m---q
|
||
|
||
p1: n p
|
||
r1: g---k
|
||
|
||
p2: b d i
|
||
r2: a---e q----v
|
||
|
||
p3: e
|
||
r3:
|
||
```
|
||
|
||
The diagram above shows is showing 4 levels, with `pX` indicating the
|
||
point operations in a level and `rX` indicating the range tombstones.
|
||
|
||
If we start iterating from the beginning, the first key we encounter
|
||
is `b` in `p2`. When the mergingIter is pointing at a valid entry, the
|
||
range deletion iterators for all of the levels less that the current
|
||
key's level are positioned at the next range tombstone past the
|
||
current key. So `r0` will point at `[m,q)` and `r1` at `[g,k)`. When
|
||
the key `b` is encountered, we check to see if the current tombstone
|
||
for `r0` or `r1` contains it, and whether the tombstone for `r2`,
|
||
`[a,e)`, contains and is newer than `b`.
|
||
|
||
Advancing the iterator finds the next key at `d`. This is in the same
|
||
level as the previous key `b` so we don't have to reposition any of
|
||
the range deletion iterators, but merely check whether `d` is now
|
||
contained by any of the range tombstones at higher levels or has
|
||
stepped past the range tombstone in its own level. In this case, there
|
||
is nothing to be done.
|
||
|
||
Advancing the iterator again finds `e`. Since `e` comes from `p3`, we
|
||
have to position the `r3` range deletion iterator, which is empty. `e`
|
||
is past the `r2` tombstone of `[a,e)` so we need to advance the `r2`
|
||
range deletion iterator to `[q,v)`.
|
||
|
||
The next key is `i`. Because this key is in `p2`, a level above `e`,
|
||
we don't have to reposition any range deletion iterators and instead
|
||
see that `i` is covered by the range tombstone `[g,k)`. The iterator
|
||
is immediately advanced to `n` which is covered by the range tombstone
|
||
`[m,q)` causing the iterator to advance to `o` which is visible.
|
||
|
||
## Flush and Compaction Pacing
|
||
|
||
Flushes and compactions in LSM trees are problematic because they
|
||
contend with foreground traffic, resulting in write and read latency
|
||
spikes. Without throttling the rate of flushes and compactions, they
|
||
occur "as fast as possible" (which is not entirely true, since we
|
||
have a `bytes_per_sync` option). This instantaneous usage of CPU and
|
||
disk IO results in potentially huge latency spikes for writes and
|
||
reads which occur in parallel to the flushes and compactions.
|
||
|
||
RocksDB attempts to solve this issue by offering an option to limit
|
||
the speed of flushes and compactions. A maximum `bytes/sec` can be
|
||
specified through the options, and background IO usage will be limited
|
||
to the specified amount. Flushes are given priority over compactions,
|
||
but they still use the same rate limiter. Though simple to implement
|
||
and understand, this option is fragile for various reasons.
|
||
|
||
1) If the rate limit is configured too low, the DB will stall and
|
||
write throughput will be affected.
|
||
2) If the rate limit is configured too high, the write and read
|
||
latency spikes will persist.
|
||
3) A different configuration is needed per system depending on the
|
||
speed of the storage device.
|
||
4) Write rates typically do not stay the same throughout the lifetime
|
||
of the DB (higher throughput during certain times of the day, etc) but
|
||
the rate limit cannot be configured during runtime.
|
||
|
||
RocksDB also offers an
|
||
["auto-tuned" rate limiter](https://rocksdb.org/blog/2017/12/18/17-auto-tuned-rate-limiter.html)
|
||
which uses a simple multiplicative-increase, multiplicative-decrease
|
||
algorithm to dynamically adjust the background IO rate limit depending
|
||
on how much of the rate limiter has been exhausted in an interval.
|
||
This solves the problem of having a static rate limit, but Pebble
|
||
attempts to improve on this with a different pacing mechanism.
|
||
|
||
Pebble's pacing mechanism uses separate rate limiters for flushes and
|
||
compactions. Both the flush and compaction pacing mechanisms work by
|
||
attempting to flush and compact only as fast as needed and no faster.
|
||
This is achieved differently for flushes versus compactions.
|
||
|
||
For flush pacing, Pebble keeps the rate at which the memtable is
|
||
flushed at the same rate as user writes. This ensures that disk IO
|
||
used by flushes remains steady. When a mutable memtable becomes full
|
||
and is marked immutable, it is typically flushed as fast as possible.
|
||
Instead of flushing as fast as possible, what we do is look at the
|
||
total number of bytes in all the memtables (mutable + queue of
|
||
immutables) and subtract the number of bytes that have been flushed in
|
||
the current flush. This number gives us the total number of bytes
|
||
which remain to be flushed. If we keep this number steady at a constant
|
||
level, we have the invariant that the flush rate is equal to the write
|
||
rate.
|
||
|
||
When the number of bytes remaining to be flushed falls below our
|
||
target level, we slow down the speed of flushing. We keep a minimum
|
||
rate at which the memtable is flushed so that flushes proceed even if
|
||
writes have stopped. When the number of bytes remaining to be flushed
|
||
goes above our target level, we allow the flush to proceed as fast as
|
||
possible, without applying any rate limiting. However, note that the
|
||
second case would indicate that writes are occurring faster than the
|
||
memtable can flush, which would be an unsustainable rate. The LSM
|
||
would soon hit the memtable count stall condition and writes would be
|
||
completely stopped.
|
||
|
||
For compaction pacing, Pebble uses an estimation of compaction debt,
|
||
which is the number of bytes which need to be compacted before no
|
||
further compactions are needed. This estimation is calculated by
|
||
looking at the number of bytes that have been flushed by the current
|
||
flush routine, adding those bytes to the size of the level 0 sstables,
|
||
then seeing how many bytes exceed the target number of bytes for the
|
||
level 0 sstables. We multiply the number of bytes exceeded by the
|
||
level ratio and add that number to the compaction debt estimate.
|
||
We repeat this process until the final level, which gives us a final
|
||
compaction debt estimate for the entire LSM tree.
|
||
|
||
Like with flush pacing, we want to keep the compaction debt at a
|
||
constant level. This ensures that compactions occur only as fast as
|
||
needed and no faster. If the compaction debt estimate falls below our
|
||
target level, we slow down compactions. We maintain a minimum
|
||
compaction rate so that compactions proceed even if flushes have
|
||
stopped. If the compaction debt goes above our target level, we let
|
||
compactions proceed as fast as possible without any rate limiting.
|
||
Just like with flush pacing, this would indicate that writes are
|
||
occurring faster than the background compactions can keep up with,
|
||
which is an unsustainable rate. The LSM's read amplification would
|
||
increase and the L0 file count stall condition would be hit.
|
||
|
||
With the combined flush and compaction pacing mechanisms, flushes and
|
||
compactions only occur as fast as needed and no faster, which reduces
|
||
latency spikes for user read and write operations.
|
||
|
||
## Write throttling
|
||
|
||
RocksDB adds artificial delays to user writes when certain thresholds
|
||
are met, such as `l0_slowdown_writes_threshold`. These artificial
|
||
delays occur when the system is close to stalling to lessen the write
|
||
pressure so that flushing and compactions can catch up. On the surface
|
||
this seems good, since write stalls would seemingly be eliminated and
|
||
replaced with gradual slowdowns. Closed loop write latency benchmarks
|
||
would show the elimination of abrupt write stalls, which seems
|
||
desirable.
|
||
|
||
However, this doesn't do anything to improve latencies in an open loop
|
||
model, which is the model more likely to resemble real world use
|
||
cases. Artificial delays increase write latencies without a clear
|
||
benefit. Writes stalls in an open loop system would indicate that
|
||
writes are generated faster than the system could possibly handle,
|
||
which adding artificial delays won't solve.
|
||
|
||
For this reason, Pebble doesn't add artificial delays to user writes
|
||
and writes are served as quickly as possible.
|
||
|
||
### Other Differences
|
||
|
||
* `internalIterator` API which minimizes indirect (virtual) function
|
||
calls
|
||
* Previous pointers in the memtable and indexed batch skiplists
|
||
* Elision of per-key lower/upper bound checks in long range scans
|
||
* Improved `Iterator` API
|
||
+ `SeekPrefixGE` for prefix iteration
|
||
+ `SetBounds` for adjusting the bounds on an existing `Iterator`
|
||
* Simpler `Get` implementation
|