mirror of
https://source.quilibrium.com/quilibrium/ceremonyclient.git
synced 2024-12-26 00:25:17 +00:00
758 lines
36 KiB
Markdown
758 lines
36 KiB
Markdown
|
# Pebble vs RocksDB: Implementation Differences
|
|||
|
|
|||
|
RocksDB is a key-value store implemented using a Log-Structured
|
|||
|
Merge-Tree (LSM). This document is not a primer on LSMs. There exist
|
|||
|
some decent
|
|||
|
[introductions](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/)
|
|||
|
on the web, or try chapter 3 of [Designing Data-Intensive
|
|||
|
Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321).
|
|||
|
|
|||
|
Pebble inherits the RocksDB file formats, has a similar API, and
|
|||
|
shares many implementation details, but it also has many differences
|
|||
|
that improve performance, reduce implementation complexity, or extend
|
|||
|
functionality. This document highlights some of the more important
|
|||
|
differences.
|
|||
|
|
|||
|
* [Internal Keys](#internal-keys)
|
|||
|
* [Indexed Batches](#indexed-batches)
|
|||
|
* [Large Batches](#large-batches)
|
|||
|
* [Commit Pipeline](#commit-pipeline)
|
|||
|
* [Range Deletions](#range-deletions)
|
|||
|
* [Flush and Compaction Pacing](#flush-and-compaction-pacing)
|
|||
|
* [Write Throttling](#write-throttling)
|
|||
|
* [Other Differences](#other-differences)
|
|||
|
|
|||
|
## Internal Keys
|
|||
|
|
|||
|
The external RocksDB API accepts keys and values. Due to the LSM
|
|||
|
structure, keys are never updated in place, but overwritten with new
|
|||
|
versions. Inside RocksDB, these versioned keys are known as Internal
|
|||
|
Keys. An Internal Key is composed of the user specified key, a
|
|||
|
sequence number and a kind. On disk, sstables always store Internal
|
|||
|
Keys.
|
|||
|
|
|||
|
```
|
|||
|
+-------------+------------+----------+
|
|||
|
| UserKey (N) | SeqNum (7) | Kind (1) |
|
|||
|
+-------------+------------+----------+
|
|||
|
```
|
|||
|
|
|||
|
The `Kind` field indicates the type of key: set, merge, delete, etc.
|
|||
|
|
|||
|
While Pebble inherits the Internal Key encoding for format
|
|||
|
compatibility, it diverges from RocksDB in how it manages Internal
|
|||
|
Keys in its implementation. In RocksDB, Internal Keys are represented
|
|||
|
either in encoded form (as a string) or as a `ParsedInternalKey`. The
|
|||
|
latter is a struct with the components of the Internal Key as three
|
|||
|
separate fields.
|
|||
|
|
|||
|
```c++
|
|||
|
struct ParsedInternalKey {
|
|||
|
Slice user_key;
|
|||
|
uint64 seqnum;
|
|||
|
uint8 kind;
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
The component format is convenient: changing the `SeqNum` or `Kind` is
|
|||
|
field assignment. Extracting the `UserKey` is a field
|
|||
|
reference. However, RocksDB tends to only use `ParsedInternalKey`
|
|||
|
locally. The major internal APIs, such as `InternalIterator`, operate
|
|||
|
using encoded internal keys (i.e. strings) for parameters and return
|
|||
|
values.
|
|||
|
|
|||
|
To give a concrete example of the overhead this causes, consider
|
|||
|
`Iterator::Seek(user_key)`. The external `Iterator` is implemented on
|
|||
|
top of an `InternalIterator`. `Iterator::Seek` ends up calling
|
|||
|
`InternalIterator::Seek`. Both Seek methods take a key, but
|
|||
|
`InternalIterator::Seek` expects an encoded Internal Key. This is both
|
|||
|
error prone and expensive. The key passed to `Iterator::Seek` needs to
|
|||
|
be copied into a temporary string in order to append the `SeqNum` and
|
|||
|
`Kind`. In Pebble, Internal Keys are represented in memory using an
|
|||
|
`InternalKey` struct that is the analog of `ParsedInternalKey`. All
|
|||
|
internal APIs use `InternalKeys`, with the exception of the lowest
|
|||
|
level routines for decoding data from sstables. In Pebble, since the
|
|||
|
interfaces all take and return the `InternalKey` struct, we don’t need
|
|||
|
to allocate to construct the Internal Key from the User Key, but
|
|||
|
RocksDB sometimes needs to allocate, and encode (i.e. make a
|
|||
|
copy). The use of the encoded form also causes RocksDB to pass encoded
|
|||
|
keys to the comparator routines, sometimes decoding the keys multiple
|
|||
|
times during the course of processing.
|
|||
|
|
|||
|
## Indexed Batches
|
|||
|
|
|||
|
In RocksDB, a batch is the unit for all write operations. Even writing
|
|||
|
a single key is transformed internally to a batch. The batch internal
|
|||
|
representation is a contiguous byte buffer with a fixed 12-byte
|
|||
|
header, followed by a series of records.
|
|||
|
|
|||
|
```
|
|||
|
+------------+-----------+--- ... ---+
|
|||
|
| SeqNum (8) | Count (4) | Entries |
|
|||
|
+------------+-----------+--- ... ---+
|
|||
|
```
|
|||
|
|
|||
|
Each record has a 1-byte kind tag prefix, followed by 1 or 2 length
|
|||
|
prefixed strings (varstring):
|
|||
|
|
|||
|
```
|
|||
|
+----------+-----------------+-------------------+
|
|||
|
| Kind (1) | Key (varstring) | Value (varstring) |
|
|||
|
+----------+-----------------+-------------------+
|
|||
|
```
|
|||
|
|
|||
|
(The `Kind` indicates if there are 1 or 2 varstrings. `Set`, `Merge`,
|
|||
|
and `DeleteRange` have 2 varstrings, while `Delete` has 1.)
|
|||
|
|
|||
|
Adding a mutation to a batch involves appending a new record to the
|
|||
|
buffer. This format is extremely fast for writes, but the lack of
|
|||
|
indexing makes it untenable to use directly for reads. In order to
|
|||
|
support iteration, a separate indexing structure is created. Both
|
|||
|
RocksDB and Pebble use a skiplist for the indexing structure, but with
|
|||
|
a clever twist. Rather than the skiplist storing a copy of the key, it
|
|||
|
simply stores the offset of the record within the mutation buffer. The
|
|||
|
result is that the skiplist acts a multi-map (i.e. a map that can have
|
|||
|
duplicate entries for a given key). The iteration order for this map
|
|||
|
is constructed so that records sort on key, and for equal keys they
|
|||
|
sort on descending offset. Newer records for the same key appear
|
|||
|
before older records.
|
|||
|
|
|||
|
While the indexing structure for batches is nearly identical between
|
|||
|
RocksDB and Pebble, how the index structure is used is completely
|
|||
|
different. In RocksDB, a batch is indexed using the
|
|||
|
`WriteBatchWithIndex` class. The `WriteBatchWithIndex` class provides
|
|||
|
a `NewIteratorWithBase` method that allows iteration over the merged
|
|||
|
view of the batch contents and an underlying "base" iterator created
|
|||
|
from the database. `BaseDeltaIterator` contains logic to iterate over
|
|||
|
the batch entries and the base iterator in parallel which allows us to
|
|||
|
perform reads on a snapshot of the database as though the batch had
|
|||
|
been applied to it. On the surface this sounds reasonable, yet the
|
|||
|
implementation is incomplete. Merge and DeleteRange operations are not
|
|||
|
supported. The reason they are not supported is because handling them
|
|||
|
is complex and requires duplicating logic that already exists inside
|
|||
|
RocksDB for normal iterator processing.
|
|||
|
|
|||
|
Pebble takes a different approach to iterating over a merged view of a
|
|||
|
batch's contents and the underlying database: it treats the batch as
|
|||
|
another level in the LSM. Recall that an LSM is composed of zero or
|
|||
|
more memtable layers and zero or more sstable layers. Internally, both
|
|||
|
RocksDB and Pebble contain a `MergingIterator` that knows how to merge
|
|||
|
the operations from different levels, including processing overwritten
|
|||
|
keys, merge operations, and delete range operations. The challenge
|
|||
|
with treating the batch as another level to be used by a
|
|||
|
`MergingIterator` is that the records in a batch do not have a
|
|||
|
sequence number. The sequence number in the batch header is not
|
|||
|
assigned until the batch is committed. The solution is to give the
|
|||
|
batch records temporary sequence numbers. We need these temporary
|
|||
|
sequence numbers to be larger than any other sequence number in the
|
|||
|
database so that the records in the batch are considered newer than
|
|||
|
any committed record. This is accomplished by reserving the high-bit
|
|||
|
in the 56-bit sequence number for use as a marker for batch sequence
|
|||
|
numbers. The sequence number for a record in an uncommitted batch is:
|
|||
|
|
|||
|
```
|
|||
|
RecordOffset | (1<<55)
|
|||
|
```
|
|||
|
|
|||
|
Newer records in a given batch will have a larger sequence number than
|
|||
|
older records in the batch. And all of the records in a batch will
|
|||
|
have larger sequence numbers than any committed record in the
|
|||
|
database.
|
|||
|
|
|||
|
The end result is that Pebble's batch iterators support all of the
|
|||
|
functionality of regular database iterators with minimal additional
|
|||
|
code.
|
|||
|
|
|||
|
## Large Batches
|
|||
|
|
|||
|
The size of a batch is limited only by available memory, yet the
|
|||
|
required memory is not just the batch representation. When a batch is
|
|||
|
committed, the commit operation iterates over the records in the batch
|
|||
|
from oldest to newest and inserts them into the current memtable. The
|
|||
|
memtable is an in-memory structure that buffers mutations that have
|
|||
|
been committed (written to the Write Ahead Log), but not yet written
|
|||
|
to an sstable. Internally, a memtable uses a skiplist to index
|
|||
|
records. Each skiplist entry has overhead for the index links and
|
|||
|
other metadata that is a dozen bytes at minimum. A large batch
|
|||
|
composed of many small records can require twice as much memory when
|
|||
|
inserted into a memtable than it required in the batch. And note that
|
|||
|
this causes a temporary increase in memory requirements because the
|
|||
|
batch memory is not freed until it is completely committed.
|
|||
|
|
|||
|
A non-obvious implementation restriction present in both RocksDB and
|
|||
|
Pebble is that there is a one-to-one correspondence between WAL files
|
|||
|
and memtables. That is, a given WAL file has a single memtable
|
|||
|
associated with it and vice-versa. While this restriction could be
|
|||
|
removed, doing so is onerous and intricate. It should also be noted
|
|||
|
that committing a batch involves writing it to a single WAL file. The
|
|||
|
combination of restrictions results in a batch needing to be written
|
|||
|
entirely to a single memtable.
|
|||
|
|
|||
|
What happens if a batch is too large to fit in a memtable? Memtables
|
|||
|
are generally considered to have a fixed size, yet this is not
|
|||
|
actually true in RocksDB. In RocksDB, the memtable skiplist is
|
|||
|
implemented on top of an arena structure. An arena is composed of a
|
|||
|
list of fixed size chunks, with no upper limit set for the number of
|
|||
|
chunks that can be associated with an arena. So RocksDB handles large
|
|||
|
batches by allowing a memtable to grow beyond its configured
|
|||
|
size. Concretely, while RocksDB may be configured with a 64MB memtable
|
|||
|
size, a 1GB batch will cause the memtable to grow to accomodate
|
|||
|
it. Functionally, this is good, though there is a practical problem: a
|
|||
|
large batch is first written to the WAL, and then added to the
|
|||
|
memtable. Adding the large batch to the memtable may consume so much
|
|||
|
memory that the system runs out of memory and is killed by the
|
|||
|
kernel. This can result in a death loop because upon restarting as the
|
|||
|
batch is read from the WAL and applied to the memtable again.
|
|||
|
|
|||
|
In Pebble, the memtable is also implemented using a skiplist on top of
|
|||
|
an arena. Significantly, the Pebble arena is a fixed size. While the
|
|||
|
RocksDB skiplist uses pointers, the Pebble skiplist uses offsets from
|
|||
|
the start of the arena. The fixed size arena means that the Pebble
|
|||
|
memtable cannot expand arbitrarily. A batch that is too large to fit
|
|||
|
in the memtable causes the current mutable memtable to be marked as
|
|||
|
immutable and the batch is wrapped in a `flushableBatch` structure and
|
|||
|
added to the list of immutable memtables. Because the `flushableBatch`
|
|||
|
is readable as another layer in the LSM, the batch commit can return
|
|||
|
as soon as the `flushableBatch` has been added to the immutable
|
|||
|
memtable list.
|
|||
|
|
|||
|
Internally, a `flushableBatch` provides iterator support by sorting
|
|||
|
the batch contents (the batch is sorted once, when it is added to the
|
|||
|
memtable list). Sorting the batch contents and insertion of the
|
|||
|
contents into a memtable have the same big-O time, but the constant
|
|||
|
factor dominates here. Sorting is significantly faster and uses
|
|||
|
significantly less memory due to not having to copy the batch records.
|
|||
|
|
|||
|
Note that an effect of this large batch support is that Pebble can be
|
|||
|
configured as an efficient on-disk sorter: specify a small memtable
|
|||
|
size, disable the WAL, and set a large L0 compaction threshold. In
|
|||
|
order to sort a large amount of data, create batches that are larger
|
|||
|
than the memtable size and commit them. When committed these batches
|
|||
|
will not be inserted into a memtable, but instead sorted and then
|
|||
|
written out to L0. The fully sorted data can later be read and the
|
|||
|
normal merging process will take care of the final ordering.
|
|||
|
|
|||
|
## Commit Pipeline
|
|||
|
|
|||
|
The commit pipeline is the component which manages the steps in
|
|||
|
committing write batches, such as writing the batch to the WAL and
|
|||
|
applying its contents to the memtable. While simple conceptually, the
|
|||
|
commit pipeline is crucial for high performance. In the absence of
|
|||
|
concurrency, commit performance is limited by how fast a batch can be
|
|||
|
written (and synced) to the WAL and then added to the memtable, both
|
|||
|
of which are outside of the purview of the commit pipeline.
|
|||
|
|
|||
|
To understand the challenge here, it is useful to have a conception of
|
|||
|
the WAL (write-ahead log). The WAL contains a record of all of the
|
|||
|
batches that have been committed to the database. As a record is
|
|||
|
written to the WAL it is added to the memtable. Each record is
|
|||
|
assigned a sequence number which is used to distinguish newer updates
|
|||
|
from older ones. Conceptually the WAL looks like:
|
|||
|
|
|||
|
```
|
|||
|
+--------------------------------------+
|
|||
|
| Batch(SeqNum=1,Count=9,Records=...) |
|
|||
|
+--------------------------------------+
|
|||
|
| Batch(SeqNum=10,Count=5,Records=...) |
|
|||
|
+--------------------------------------+
|
|||
|
| Batch(SeqNum=15,Count=7,Records...) |
|
|||
|
+--------------------------------------+
|
|||
|
| ... |
|
|||
|
+--------------------------------------+
|
|||
|
```
|
|||
|
|
|||
|
Note that each WAL entry is precisely the batch representation
|
|||
|
described earlier in the [Indexed Batches](#indexed-batches)
|
|||
|
section. The monotonically increasing sequence numbers are a critical
|
|||
|
component in allowing RocksDB and Pebble to provide fast snapshot
|
|||
|
views of the database for reads.
|
|||
|
|
|||
|
If concurrent performance was not a concern, the commit pipeline could
|
|||
|
simply be a mutex which serialized writes to the WAL and application
|
|||
|
of the batch records to the memtable. Concurrent performance is a
|
|||
|
concern, though.
|
|||
|
|
|||
|
The primary challenge in concurrent performance in the commit pipeline
|
|||
|
is maintaining two invariants:
|
|||
|
|
|||
|
1. Batches need to be written to the WAL in sequence number order.
|
|||
|
2. Batches need to be made visible for reads in sequence number
|
|||
|
order. This invariant arises from the use of a single sequence
|
|||
|
number which indicates which mutations are visible.
|
|||
|
|
|||
|
The second invariant deserves explanation. RocksDB and Pebble both
|
|||
|
keep track of a visible sequence number. This is the sequence number
|
|||
|
for which records in the database are visible during reads. The
|
|||
|
visible sequence number exists because committing a batch is an atomic
|
|||
|
operation, yet adding records to the memtable is done without an
|
|||
|
exclusive lock (the skiplists used by both Pebble and RocksDB are
|
|||
|
lock-free). When the records from a batch are being added to the
|
|||
|
memtable, a concurrent read operation may see those records, but will
|
|||
|
skip over them because they are newer than the visible sequence
|
|||
|
number. Once all of the records in the batch have been added to the
|
|||
|
memtable, the visible sequence number is atomically incremented.
|
|||
|
|
|||
|
So we have four steps in committing a write batch:
|
|||
|
|
|||
|
1. Write the batch to the WAL
|
|||
|
2. Apply the mutations in the batch to the memtable
|
|||
|
3. Bump the visible sequence number
|
|||
|
4. (Optionally) sync the WAL
|
|||
|
|
|||
|
Writing the batch to the WAL is actually very fast as it is just a
|
|||
|
memory copy. Applying the mutations in the batch to the memtable is by
|
|||
|
far the most CPU intensive part of the commit pipeline. Syncing the
|
|||
|
WAL is the most expensive from a wall clock perspective.
|
|||
|
|
|||
|
With that background out of the way, let's examine how RocksDB commits
|
|||
|
batches. This description is of the traditional commit pipeline in
|
|||
|
RocksDB (i.e. the one used by CockroachDB).
|
|||
|
|
|||
|
RocksDB achieves concurrency in the commit pipeline by grouping
|
|||
|
concurrently committed batches into a batch group. Each group is
|
|||
|
assigned a "leader" which is the first batch to be added to the
|
|||
|
group. The batch group is written atomically to the WAL by the leader
|
|||
|
thread, and then the individual batches making up the group are
|
|||
|
concurrently applied to the memtable. Lastly, the visible sequence
|
|||
|
number is bumped such that all of the batches in the group become
|
|||
|
visible in a single atomic step. While a batch group is being applied,
|
|||
|
other concurrent commits are added to a waiting list. When the group
|
|||
|
commit finishes, the waiting commits form the next group.
|
|||
|
|
|||
|
There are two criticisms of the batch grouping approach. The first is
|
|||
|
that forming a batch group involves copying batch contents. RocksDB
|
|||
|
partially alleviates this for large batches by placing a limit on the
|
|||
|
total size of a group. A large batch will end up in its own group and
|
|||
|
not be copied, but the criticism still applies for small batches. Note
|
|||
|
that there are actually two copies here. The batch contents are
|
|||
|
concatenated together to form the group, and then the group contents
|
|||
|
are written into an in memory buffer for the WAL before being written
|
|||
|
to disk.
|
|||
|
|
|||
|
The second criticism is about the thread synchronization points. Let's
|
|||
|
consider what happens to a commit which becomes the leader:
|
|||
|
|
|||
|
1. Lock commit mutex
|
|||
|
2. Wait to become leader
|
|||
|
3. Form (concatenate) batch group and write to the WAL
|
|||
|
4. Notify followers to apply their batch to the memtable
|
|||
|
5. Apply own batch to memtable
|
|||
|
6. Wait for followers to finish
|
|||
|
7. Bump visible sequence number
|
|||
|
8. Unlock commit mutex
|
|||
|
9. Notify followers that the commit is complete
|
|||
|
|
|||
|
The follower's set of operations looks like:
|
|||
|
|
|||
|
1. Lock commit mutex
|
|||
|
2. Wait to become follower
|
|||
|
3. Wait to be notified that it is time to apply batch
|
|||
|
4. Unlock commit mutex
|
|||
|
5. Apply batch to memtable
|
|||
|
6. Wait to be notified that commit is complete
|
|||
|
|
|||
|
The thread synchronization points (all of the waits and notifies) are
|
|||
|
overhead. Reducing that overhead can improve performance.
|
|||
|
|
|||
|
The Pebble commit pipeline addresses both criticisms. The main
|
|||
|
innovation is a commit queue that mirrors the commit order. The Pebble
|
|||
|
commit pipeline looks like:
|
|||
|
|
|||
|
1. Lock commit mutex
|
|||
|
* Add batch to commit queue
|
|||
|
* Assign batch sequence number
|
|||
|
* Write batch to the WAL
|
|||
|
2. Unlock commit mutex
|
|||
|
3. Apply batch to memtable (concurrently)
|
|||
|
4. Publish batch sequence number
|
|||
|
|
|||
|
Pebble does not use the concept of a batch group. Each batch is
|
|||
|
individually written to the WAL, but note that the WAL write is just a
|
|||
|
memory copy into an internal buffer in the WAL.
|
|||
|
|
|||
|
Step 4 deserves further scrutiny as it is where the invariant on the
|
|||
|
visible batch sequence number is maintained. Publishing the batch
|
|||
|
sequence number cannot simply bump the visible sequence number because
|
|||
|
batches with earlier sequence numbers may still be applying to the
|
|||
|
memtable. If we were to ratchet the visible sequence number without
|
|||
|
waiting for those applies to finish, a concurrent reader could see
|
|||
|
partial batch contents. Note that RocksDB has experimented with
|
|||
|
allowing these semantics with its unordered writes option.
|
|||
|
|
|||
|
We want to retain the atomic visibility of batch commits. The publish
|
|||
|
batch sequence number step needs to ensure that we don't ratchet the
|
|||
|
visible sequence number until all batches with earlier sequence
|
|||
|
numbers have applied. Enter the commit queue: a lock-free
|
|||
|
single-producer, multi-consumer queue. Batches are added to the commit
|
|||
|
queue with the commit mutex held, ensuring the same order as the
|
|||
|
sequence number assignment. After a batch finishes applying to the
|
|||
|
memtable, it atomically marks the batch as applied. It then removes
|
|||
|
the prefix of applied batches from the commit queue, bumping the
|
|||
|
visible sequence number, and marking the batch as committed (via a
|
|||
|
`sync.WaitGroup`). If the first batch in the commit queue has not be
|
|||
|
applied we wait for our batch to be committed, relying on another
|
|||
|
concurrent committer to perform the visible sequence ratcheting for
|
|||
|
our batch. We know a concurrent commit is taking place because if
|
|||
|
there was only one batch committing it would be at the head of the
|
|||
|
commit queue.
|
|||
|
|
|||
|
There are two possibilities when publishing a sequence number. The
|
|||
|
first is that there is an unapplied batch at the head of the
|
|||
|
queue. Consider the following scenario where we're trying to publish
|
|||
|
the sequence number for batch `B`.
|
|||
|
|
|||
|
```
|
|||
|
+---------------+-------------+---------------+-----+
|
|||
|
| A (unapplied) | B (applied) | C (unapplied) | ... |
|
|||
|
+---------------+-------------+---------------+-----+
|
|||
|
```
|
|||
|
|
|||
|
The publish routine will see that `A` is unapplied and then simply
|
|||
|
wait for `B's` done `sync.WaitGroup` to be signalled. This is safe
|
|||
|
because `A` must still be committing. And if `A` has concurrently been
|
|||
|
marked as applied, the goroutine publishing `A` will then publish
|
|||
|
`B`. What happens when `A` publishes its sequence number? The commit
|
|||
|
queue state becomes:
|
|||
|
|
|||
|
```
|
|||
|
+-------------+-------------+---------------+-----+
|
|||
|
| A (applied) | B (applied) | C (unapplied) | ... |
|
|||
|
+-------------+-------------+---------------+-----+
|
|||
|
```
|
|||
|
|
|||
|
The publish routine pops `A` from the queue, ratchets the sequence
|
|||
|
number, then pops `B` and ratchets the sequence number again, and then
|
|||
|
finds `C` and stops. A detail that it is important to notice is that
|
|||
|
the committer for batch `B` didn't have to do any more work. An
|
|||
|
alternative approach would be to have `B` wakeup and ratchet its own
|
|||
|
sequence number, but that would serialize the remainder of the commit
|
|||
|
queue behind that goroutine waking up.
|
|||
|
|
|||
|
The commit queue reduces the number of thread synchronization
|
|||
|
operations required to commit a batch. There is no leader to notify,
|
|||
|
or followers to wait for. A commit either publishes its own sequence
|
|||
|
number, or performs one synchronization operation to wait for a
|
|||
|
concurrent committer to publish its sequence number.
|
|||
|
|
|||
|
## Range Deletions
|
|||
|
|
|||
|
Deletion of an individual key in RocksDB and Pebble is accomplished by
|
|||
|
writing a deletion tombstone. A deletion tombstone shadows an existing
|
|||
|
value for a key, causing reads to treat the key as not present. The
|
|||
|
deletion tombstone mechanism works well for deleting small sets of
|
|||
|
keys, but what happens if you want to all of the keys within a range
|
|||
|
of keys that might number in the thousands or millions? A range
|
|||
|
deletion is an operation which deletes an entire range of keys with a
|
|||
|
single record. In contrast to a point deletion tombstone which
|
|||
|
specifies a single key, a range deletion tombstone (a.k.a. range
|
|||
|
tombstone) specifies a start key (inclusive) and an end key
|
|||
|
(exclusive). This single record is much faster to write than thousands
|
|||
|
or millions of point deletion tombstones, and can be done blindly --
|
|||
|
without iterating over the keys that need to be deleted. The downside
|
|||
|
to range tombstones is that they require additional processing during
|
|||
|
reads. How the processing of range tombstones is done significantly
|
|||
|
affects both the complexity of the implementation, and the efficiency
|
|||
|
of read operations in the presence of range tombstones.
|
|||
|
|
|||
|
A range tombstone is composed of a start key, end key, and sequence
|
|||
|
number. Any key that falls within the range is considered deleted if
|
|||
|
the key's sequence number is less than the range tombstone's sequence
|
|||
|
number. RocksDB stores range tombstones segregated from point
|
|||
|
operations in a special range deletion block within each sstable.
|
|||
|
Conceptually, the range tombstones stored within an sstable are
|
|||
|
truncated to the boundaries of the sstable, though there are
|
|||
|
complexities that cause this to not actually be physically true.
|
|||
|
|
|||
|
In RocksDB, the main structure implementing range tombstone processing
|
|||
|
is the `RangeDelAggregator`. Each read operation and iterator has its
|
|||
|
own `RangeDelAggregator` configured for the sequence number the read
|
|||
|
is taking place at. The initial implementation of `RangeDelAggregator`
|
|||
|
built up a "skyline" for the range tombstones visible at the read
|
|||
|
sequence number.
|
|||
|
|
|||
|
```
|
|||
|
10 +---+
|
|||
|
9 | |
|
|||
|
8 | |
|
|||
|
7 | +----+
|
|||
|
6 | |
|
|||
|
5 +-+ | +----+
|
|||
|
4 | | | |
|
|||
|
3 | | | +---+
|
|||
|
2 | | | |
|
|||
|
1 | | | |
|
|||
|
0 | | | |
|
|||
|
abcdefghijklmnopqrstuvwxyz
|
|||
|
```
|
|||
|
|
|||
|
The above diagram shows the skyline created for the range tombstones
|
|||
|
`[b,j)#5`, `[d,h)#10`, `[f,m)#7`, `[p,u)#5`, and `[t,y)#3`. The
|
|||
|
skyline is queried for each key read to see if the key should be
|
|||
|
considered deleted or not. The skyline structure is stored in a binary
|
|||
|
tree, making the queries an O(logn) operation in the number of
|
|||
|
tombstones, though there is an optimization to make this O(1) for
|
|||
|
`next`/`prev` iteration. Note that the skyline representation loses
|
|||
|
information about the range tombstones. This requires the structure to
|
|||
|
be rebuilt on every read which has a significant performance impact.
|
|||
|
|
|||
|
The initial skyline range tombstone implementation has since been
|
|||
|
replaced with a more efficient lookup structure. See the
|
|||
|
[DeleteRange](https://rocksdb.org/blog/2018/11/21/delete-range.html)
|
|||
|
blog post for a good description of both the original implementation
|
|||
|
and the new (v2) implementation. The key change in the new
|
|||
|
implementation is to "fragment" the range tombstones that are stored
|
|||
|
in an sstable. The fragmented range tombstones provide the same
|
|||
|
benefit as the skyline representation: the ability to binary search
|
|||
|
the fragments in order to find the tombstone covering a key. But
|
|||
|
unlike the skyline approach, the fragmented tombstones can be cached
|
|||
|
on a per-sstable basis. In the v2 approach, `RangeDelAggregator` keeps
|
|||
|
track of the fragmented range tombstones for each sstable encountered
|
|||
|
during a read or iterator, and logically merges them together.
|
|||
|
|
|||
|
Fragmenting range tombstones involves splitting range tombstones at
|
|||
|
overlap points. Let's consider the tombstones in the skyline example
|
|||
|
above:
|
|||
|
|
|||
|
```
|
|||
|
10: d---h
|
|||
|
7: f------m
|
|||
|
5: b-------j p----u
|
|||
|
3: t----y
|
|||
|
```
|
|||
|
|
|||
|
Fragmenting the range tombstones at the overlap points creates a
|
|||
|
larger number of range tombstones:
|
|||
|
|
|||
|
```
|
|||
|
10: d-f-h
|
|||
|
7: f-h-j--m
|
|||
|
5: b-d-f-h-j p---tu
|
|||
|
3: tu---y
|
|||
|
```
|
|||
|
|
|||
|
While the number of tombstones is larger there is a significant
|
|||
|
advantage: we can order the tombstones by their start key and then
|
|||
|
binary search to find the set of tombstones overlapping a particular
|
|||
|
point. This is possible because due to the fragmenting, all the
|
|||
|
tombstones that overlap a range of keys will have the same start and
|
|||
|
end key. The v2 `RangeDelAggregator` and associated classes perform
|
|||
|
fragmentation of range tombstones stored in each sstable and those
|
|||
|
fragmented tombstones are then cached.
|
|||
|
|
|||
|
In summary, in RocksDB `RangeDelAggregator` acts as an oracle for
|
|||
|
answering whether a key is deleted at a particular sequence
|
|||
|
number. Due to caching of fragmented tombstones, the v2 implementation
|
|||
|
of `RangeDelAggregator` implementation is significantly faster to
|
|||
|
populate than v1, yet the overall approach to processing range
|
|||
|
tombstones remains similar.
|
|||
|
|
|||
|
Pebble takes a different approach: it integrates range tombstones
|
|||
|
processing directly into the `mergingIter` structure. `mergingIter` is
|
|||
|
the internal structure which provides a merged view of the levels in
|
|||
|
an LSM. RocksDB has a similar class named
|
|||
|
`MergingIterator`. Internally, `mergingIter` maintains a heap over the
|
|||
|
levels in the LSM (note that each memtable and L0 table is a separate
|
|||
|
"level" in `mergingIter`). In RocksDB, `MergingIterator` knows nothing
|
|||
|
about range tombstones, and it is thus up to higher-level code to
|
|||
|
process range tombstones using `RangeDelAggregator`.
|
|||
|
|
|||
|
While the separation of `MergingIterator` and range tombstones seems
|
|||
|
reasonable at first glance, there is an optimization that RocksDB does
|
|||
|
not perform which is awkward with the `RangeDelAggregator` approach:
|
|||
|
skipping swaths of deleted keys. A range tombstone often shadows more
|
|||
|
than one key. Rather than iterating over the deleted keys, it is much
|
|||
|
quicker to seek to the end point of the range tombstone. The challenge
|
|||
|
in implementing this optimization is that a key might be newer than
|
|||
|
the range tombstone and thus shouldn't be skipped. An insight to be
|
|||
|
utilized is that the level structure itself provides sufficient
|
|||
|
information. A range tombstone at `Ln` is guaranteed to be newer than
|
|||
|
any key it overlaps in `Ln+1`.
|
|||
|
|
|||
|
Pebble utilizes the insight above to integrate range deletion
|
|||
|
processing with `mergingIter`. A `mergingIter` maintains a point
|
|||
|
iterator and a range deletion iterator per level in the LSM. In this
|
|||
|
context, every L0 table is a separate level, as is every
|
|||
|
memtable. Within a level, when a range deletion contains a point
|
|||
|
operation the sequence numbers must be checked to determine if the
|
|||
|
point operation is newer or older than the range deletion
|
|||
|
tombstone. The `mergingIter` maintains the invariant that the range
|
|||
|
deletion iterators for all levels newer that the current iteration key
|
|||
|
are positioned at the next (or previous during reverse iteration)
|
|||
|
range deletion tombstone. We know those levels don't contain a range
|
|||
|
deletion tombstone that covers the current key because if they did the
|
|||
|
current key would be deleted. The range deletion iterator for the
|
|||
|
current key's level is positioned at a range tombstone covering or
|
|||
|
past the current key. The position of all of other range deletion
|
|||
|
iterators is unspecified. Whenever a key from those levels becomes the
|
|||
|
current key, their range deletion iterators need to be
|
|||
|
positioned. This lazy positioning avoids seeking the range deletion
|
|||
|
iterators for keys that are never considered.
|
|||
|
|
|||
|
For a full example, consider the following setup:
|
|||
|
|
|||
|
```
|
|||
|
p0: o
|
|||
|
r0: m---q
|
|||
|
|
|||
|
p1: n p
|
|||
|
r1: g---k
|
|||
|
|
|||
|
p2: b d i
|
|||
|
r2: a---e q----v
|
|||
|
|
|||
|
p3: e
|
|||
|
r3:
|
|||
|
```
|
|||
|
|
|||
|
The diagram above shows is showing 4 levels, with `pX` indicating the
|
|||
|
point operations in a level and `rX` indicating the range tombstones.
|
|||
|
|
|||
|
If we start iterating from the beginning, the first key we encounter
|
|||
|
is `b` in `p2`. When the mergingIter is pointing at a valid entry, the
|
|||
|
range deletion iterators for all of the levels less that the current
|
|||
|
key's level are positioned at the next range tombstone past the
|
|||
|
current key. So `r0` will point at `[m,q)` and `r1` at `[g,k)`. When
|
|||
|
the key `b` is encountered, we check to see if the current tombstone
|
|||
|
for `r0` or `r1` contains it, and whether the tombstone for `r2`,
|
|||
|
`[a,e)`, contains and is newer than `b`.
|
|||
|
|
|||
|
Advancing the iterator finds the next key at `d`. This is in the same
|
|||
|
level as the previous key `b` so we don't have to reposition any of
|
|||
|
the range deletion iterators, but merely check whether `d` is now
|
|||
|
contained by any of the range tombstones at higher levels or has
|
|||
|
stepped past the range tombstone in its own level. In this case, there
|
|||
|
is nothing to be done.
|
|||
|
|
|||
|
Advancing the iterator again finds `e`. Since `e` comes from `p3`, we
|
|||
|
have to position the `r3` range deletion iterator, which is empty. `e`
|
|||
|
is past the `r2` tombstone of `[a,e)` so we need to advance the `r2`
|
|||
|
range deletion iterator to `[q,v)`.
|
|||
|
|
|||
|
The next key is `i`. Because this key is in `p2`, a level above `e`,
|
|||
|
we don't have to reposition any range deletion iterators and instead
|
|||
|
see that `i` is covered by the range tombstone `[g,k)`. The iterator
|
|||
|
is immediately advanced to `n` which is covered by the range tombstone
|
|||
|
`[m,q)` causing the iterator to advance to `o` which is visible.
|
|||
|
|
|||
|
## Flush and Compaction Pacing
|
|||
|
|
|||
|
Flushes and compactions in LSM trees are problematic because they
|
|||
|
contend with foreground traffic, resulting in write and read latency
|
|||
|
spikes. Without throttling the rate of flushes and compactions, they
|
|||
|
occur "as fast as possible" (which is not entirely true, since we
|
|||
|
have a `bytes_per_sync` option). This instantaneous usage of CPU and
|
|||
|
disk IO results in potentially huge latency spikes for writes and
|
|||
|
reads which occur in parallel to the flushes and compactions.
|
|||
|
|
|||
|
RocksDB attempts to solve this issue by offering an option to limit
|
|||
|
the speed of flushes and compactions. A maximum `bytes/sec` can be
|
|||
|
specified through the options, and background IO usage will be limited
|
|||
|
to the specified amount. Flushes are given priority over compactions,
|
|||
|
but they still use the same rate limiter. Though simple to implement
|
|||
|
and understand, this option is fragile for various reasons.
|
|||
|
|
|||
|
1) If the rate limit is configured too low, the DB will stall and
|
|||
|
write throughput will be affected.
|
|||
|
2) If the rate limit is configured too high, the write and read
|
|||
|
latency spikes will persist.
|
|||
|
3) A different configuration is needed per system depending on the
|
|||
|
speed of the storage device.
|
|||
|
4) Write rates typically do not stay the same throughout the lifetime
|
|||
|
of the DB (higher throughput during certain times of the day, etc) but
|
|||
|
the rate limit cannot be configured during runtime.
|
|||
|
|
|||
|
RocksDB also offers an
|
|||
|
["auto-tuned" rate limiter](https://rocksdb.org/blog/2017/12/18/17-auto-tuned-rate-limiter.html)
|
|||
|
which uses a simple multiplicative-increase, multiplicative-decrease
|
|||
|
algorithm to dynamically adjust the background IO rate limit depending
|
|||
|
on how much of the rate limiter has been exhausted in an interval.
|
|||
|
This solves the problem of having a static rate limit, but Pebble
|
|||
|
attempts to improve on this with a different pacing mechanism.
|
|||
|
|
|||
|
Pebble's pacing mechanism uses separate rate limiters for flushes and
|
|||
|
compactions. Both the flush and compaction pacing mechanisms work by
|
|||
|
attempting to flush and compact only as fast as needed and no faster.
|
|||
|
This is achieved differently for flushes versus compactions.
|
|||
|
|
|||
|
For flush pacing, Pebble keeps the rate at which the memtable is
|
|||
|
flushed at the same rate as user writes. This ensures that disk IO
|
|||
|
used by flushes remains steady. When a mutable memtable becomes full
|
|||
|
and is marked immutable, it is typically flushed as fast as possible.
|
|||
|
Instead of flushing as fast as possible, what we do is look at the
|
|||
|
total number of bytes in all the memtables (mutable + queue of
|
|||
|
immutables) and subtract the number of bytes that have been flushed in
|
|||
|
the current flush. This number gives us the total number of bytes
|
|||
|
which remain to be flushed. If we keep this number steady at a constant
|
|||
|
level, we have the invariant that the flush rate is equal to the write
|
|||
|
rate.
|
|||
|
|
|||
|
When the number of bytes remaining to be flushed falls below our
|
|||
|
target level, we slow down the speed of flushing. We keep a minimum
|
|||
|
rate at which the memtable is flushed so that flushes proceed even if
|
|||
|
writes have stopped. When the number of bytes remaining to be flushed
|
|||
|
goes above our target level, we allow the flush to proceed as fast as
|
|||
|
possible, without applying any rate limiting. However, note that the
|
|||
|
second case would indicate that writes are occurring faster than the
|
|||
|
memtable can flush, which would be an unsustainable rate. The LSM
|
|||
|
would soon hit the memtable count stall condition and writes would be
|
|||
|
completely stopped.
|
|||
|
|
|||
|
For compaction pacing, Pebble uses an estimation of compaction debt,
|
|||
|
which is the number of bytes which need to be compacted before no
|
|||
|
further compactions are needed. This estimation is calculated by
|
|||
|
looking at the number of bytes that have been flushed by the current
|
|||
|
flush routine, adding those bytes to the size of the level 0 sstables,
|
|||
|
then seeing how many bytes exceed the target number of bytes for the
|
|||
|
level 0 sstables. We multiply the number of bytes exceeded by the
|
|||
|
level ratio and add that number to the compaction debt estimate.
|
|||
|
We repeat this process until the final level, which gives us a final
|
|||
|
compaction debt estimate for the entire LSM tree.
|
|||
|
|
|||
|
Like with flush pacing, we want to keep the compaction debt at a
|
|||
|
constant level. This ensures that compactions occur only as fast as
|
|||
|
needed and no faster. If the compaction debt estimate falls below our
|
|||
|
target level, we slow down compactions. We maintain a minimum
|
|||
|
compaction rate so that compactions proceed even if flushes have
|
|||
|
stopped. If the compaction debt goes above our target level, we let
|
|||
|
compactions proceed as fast as possible without any rate limiting.
|
|||
|
Just like with flush pacing, this would indicate that writes are
|
|||
|
occurring faster than the background compactions can keep up with,
|
|||
|
which is an unsustainable rate. The LSM's read amplification would
|
|||
|
increase and the L0 file count stall condition would be hit.
|
|||
|
|
|||
|
With the combined flush and compaction pacing mechanisms, flushes and
|
|||
|
compactions only occur as fast as needed and no faster, which reduces
|
|||
|
latency spikes for user read and write operations.
|
|||
|
|
|||
|
## Write throttling
|
|||
|
|
|||
|
RocksDB adds artificial delays to user writes when certain thresholds
|
|||
|
are met, such as `l0_slowdown_writes_threshold`. These artificial
|
|||
|
delays occur when the system is close to stalling to lessen the write
|
|||
|
pressure so that flushing and compactions can catch up. On the surface
|
|||
|
this seems good, since write stalls would seemingly be eliminated and
|
|||
|
replaced with gradual slowdowns. Closed loop write latency benchmarks
|
|||
|
would show the elimination of abrupt write stalls, which seems
|
|||
|
desirable.
|
|||
|
|
|||
|
However, this doesn't do anything to improve latencies in an open loop
|
|||
|
model, which is the model more likely to resemble real world use
|
|||
|
cases. Artificial delays increase write latencies without a clear
|
|||
|
benefit. Writes stalls in an open loop system would indicate that
|
|||
|
writes are generated faster than the system could possibly handle,
|
|||
|
which adding artificial delays won't solve.
|
|||
|
|
|||
|
For this reason, Pebble doesn't add artificial delays to user writes
|
|||
|
and writes are served as quickly as possible.
|
|||
|
|
|||
|
### Other Differences
|
|||
|
|
|||
|
* `internalIterator` API which minimizes indirect (virtual) function
|
|||
|
calls
|
|||
|
* Previous pointers in the memtable and indexed batch skiplists
|
|||
|
* Elision of per-key lower/upper bound checks in long range scans
|
|||
|
* Improved `Iterator` API
|
|||
|
+ `SeekPrefixGE` for prefix iteration
|
|||
|
+ `SetBounds` for adjusting the bounds on an existing `Iterator`
|
|||
|
* Simpler `Get` implementation
|