mirror of
https://source.quilibrium.com/quilibrium/ceremonyclient.git
synced 2025-01-15 18:25:40 +00:00
962 lines
47 KiB
Markdown
962 lines
47 KiB
Markdown
|
- Feature Name: Range Keys
|
||
|
- Status: draft
|
||
|
- Start Date: 2021-10-18
|
||
|
- Authors: Sumeer Bhola, Jackson Owens
|
||
|
- RFC PR: #1341
|
||
|
- Pebble Issues:
|
||
|
https://github.com/cockroachdb/pebble/issues/1339
|
||
|
- Cockroach Issues:
|
||
|
https://github.com/cockroachdb/cockroach/issues/70429
|
||
|
https://github.com/cockroachdb/cockroach/issues/70412
|
||
|
|
||
|
** Design Draft**
|
||
|
|
||
|
# Summary
|
||
|
|
||
|
An ongoing effort within CockroachDB to preserve MVCC history across all SQL
|
||
|
operations (see cockroachdb/cockroach#69380) requires a more efficient method of
|
||
|
deleting ranges of MVCC history.
|
||
|
|
||
|
This document describes an extension to Pebble introducing first-class support
|
||
|
for range keys. Range keys map a range of keyspace to a value. Optionally, the
|
||
|
key range may include an suffix encoding a version (eg, MVCC timestamp). Pebble
|
||
|
iterators may be configured to surface range keys during iteration, or to mask
|
||
|
point keys at lower MVCC timestamps covered by range keys.
|
||
|
|
||
|
CockroachDB will make use of these range keys to enable history-preserving
|
||
|
removal of contiguous ranges of MVCC keys with constant writes, and efficient
|
||
|
iteration past deleted versions.
|
||
|
|
||
|
# Background
|
||
|
|
||
|
A previous CockroachDB RFC cockroach/cockroachdb#69380 describes the motivation
|
||
|
for the larger project of migrating MVCC-noncompliant operations into MVCC
|
||
|
compliance. Implemented with the existing MVCC primitives, some operations like
|
||
|
removal of an index or table would require performing writes linearly
|
||
|
proportional to the size of the table. Dropping a large table using existing
|
||
|
MVCC point-delete primitives would be prohibitively expensive. The desire for a
|
||
|
sublinear delete of an MVCC range motivates this work.
|
||
|
|
||
|
The detailed design for MVCC compliant bulk operations ([high-level
|
||
|
description](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210825_mvcc_bulk_ops.md);
|
||
|
detailed design draft for DeleteRange in internal
|
||
|
[doc](https://docs.google.com/document/d/1ItxpitNwuaEnwv95RJORLCGuOczuS2y_GoM2ckJCnFs/edit#heading=h.x6oktstoeb9t)),
|
||
|
ran into complexity by placing range operations above the Pebble layer, such
|
||
|
that Pebble sees these as points. The complexity causes are various: (a) which
|
||
|
key (start or end) to anchor this range on, when represented as a point (there
|
||
|
are performance consequences), (b) rewriting on CockroachDB range splits (and
|
||
|
concerns about rewrite volume), (c) fragmentation on writes and complexity
|
||
|
thereof (and performance concerns for reads when not fragmenting), (d) inability
|
||
|
to efficiently skip older MVCC versions that are masked by a `[k1,k2)@ts` (where
|
||
|
ts is the MVCC timestamp).
|
||
|
|
||
|
Pebble currently has only one kind of key that is associated with a range:
|
||
|
`RANGEDEL [k1, k2)#seq`, where [k1, k2) is supplied by the caller, and is used
|
||
|
to efficiently remove a set of point keys.
|
||
|
|
||
|
First-class support for range keys in Pebble eliminates all these issues.
|
||
|
Additionally, it allows for future extensions like efficient transactional range
|
||
|
operations. This issue describes how this feature would work from the
|
||
|
perspective of a user of Pebble (like CockroachDB), and sketches some
|
||
|
implementation details.
|
||
|
|
||
|
# Design
|
||
|
|
||
|
## Interface
|
||
|
|
||
|
### New `Comparer` requirements
|
||
|
|
||
|
The Pebble `Comparer` type allows users to optionally specify a `Split` function
|
||
|
that splits a user key into a prefix and a suffix. This Split allows users
|
||
|
implementing MVCC (Multi-Version Concurrency Control) to inform Pebble which
|
||
|
part of the key encodes the user key and which part of the key encodes the
|
||
|
version (eg, a timestamp). Pebble does not dictate the encoding of an MVCC
|
||
|
version, only that the version form a suffix on keys.
|
||
|
|
||
|
The range keys design described in this RFC introduces stricter requirements for
|
||
|
user-provided `Split` implementations and the ordering of keys:
|
||
|
|
||
|
1. The user key consisting of just a key prefix `k` must sort before all
|
||
|
other user keys containing that prefix. Specifically
|
||
|
`Compare(k[:Split(k)], k) < 0` where `Split(k) < len(k)`.
|
||
|
2. A key consisting of a bare suffix must be a valid key and comparable. The
|
||
|
ordering of the empty key prefix with any suffixes must be consistent with
|
||
|
the ordering of those same suffixes applied to any other key prefix.
|
||
|
Specifically `Compare(k[Split(k):], k2[Split(k2):]) == Compare(k, k2)` where
|
||
|
`Compare(k[:Split(k)], k2[:Split(k2)]) == 0`.
|
||
|
|
||
|
The details of why these new requirements are necessary are explained in the
|
||
|
implementation section.
|
||
|
|
||
|
### Writes
|
||
|
|
||
|
This design introduces three new write operations:
|
||
|
|
||
|
- `RangeKeySet([k1, k2), [optional suffix], <value>)`: This represents the
|
||
|
mapping `[k1, k2)@suffix => value`. Keys `k1` and `k2` must not contain a
|
||
|
suffix (i.e., `Split(k1)==len(k1)` and `Split(k2)==len(k2))`.
|
||
|
|
||
|
- `RangeKeyUnset([k1, k2), [optional suffix])`: This removes a mapping
|
||
|
previously applied by `RangeKeySet`. The unset may use a smaller key range
|
||
|
than the original `RangeKeySet`, in which case only part of the range is
|
||
|
deleted. The unset only applies to range keys with a matching optional suffix.
|
||
|
If the optional suffix is absent in both the RangeKeySet and RangeKeyUnset,
|
||
|
they are considered matching.
|
||
|
|
||
|
- `RangeKeyDelete([k1, k2))`: This removes all range keys within the provided
|
||
|
key span. It behaves like an `Unset` unencumbered by suffix restrictions.
|
||
|
|
||
|
For example, consider `RangeKeySet([a,d), foo)` (i.e., no suffix). If
|
||
|
there is a later call `RangeKeyUnset([b,c))`, the resulting state seen by
|
||
|
a reader is `[a,b) => foo`, `[c,d) => foo`. Note that the value is not
|
||
|
modified when the key is fragmented.
|
||
|
|
||
|
Partially overlapping `RangeKeySet`s with the same suffix overwrite one
|
||
|
another. For example, consider `RangeKeySet([a,d), foo)`, followed by
|
||
|
`RangeKeySet([c,e), bar)`. The resulting state is `[a,c) => foo`, `[c,e)
|
||
|
=> bar`.
|
||
|
|
||
|
Point keys (eg, traditional keys defined at a singular byte string key) and
|
||
|
range keys do not overwrite one another. They have a parallel existence. Point
|
||
|
deletes only apply to points. Range unsets only apply to range keys. However,
|
||
|
users may configure iterators to mask point keys covered by newer range keys.
|
||
|
This masking behavior is explicitly requested by the user in the context of the
|
||
|
iteration. Masking is described in more detail below.
|
||
|
|
||
|
There exist separate range delete operations for point keys and range keys. A
|
||
|
`RangeKeyDelete` may remove part of a range key, just like the new
|
||
|
`RangeKeyUnset` operation introduced earlier. `RangeKeyDelete`s differ from
|
||
|
`RangeKeyUnset`s, because the latter requires that the suffix matches and
|
||
|
applies only to range keys. `RangeKeyDelete`s completely clear all existing
|
||
|
range keys within their span at all suffix values.
|
||
|
|
||
|
The optional suffix in `RangeKeySet` and `RangeKeyUnset` operations is related
|
||
|
to the pebble `Comparer.Split` operation which is explicitly documented as being
|
||
|
for [MVCC
|
||
|
keys](https://github.com/cockroachdb/pebble/blob/e95e73745ce8a85d605ef311d29a6574db8ed3bf/internal/base/comparer.go#L69-L88),
|
||
|
without mandating exactly how the versions are represented. `RangeKeySet` and
|
||
|
`RangeKeyUnset` keys with different suffixes do not interact logically, although
|
||
|
Pebble will observably fragment ranges at intersection points.
|
||
|
|
||
|
### Iteration
|
||
|
|
||
|
A user iterating over a key interval [k1,k2) can request:
|
||
|
|
||
|
- **[I1]** An iterator over only point keys.
|
||
|
|
||
|
- **[I2]** A combined iterator over point and range keys. This is what
|
||
|
we mainly discuss below in the implementation discussion.
|
||
|
|
||
|
- **[I3]** An iterator over only range keys. In the CockroachDB use
|
||
|
case, range keys will need to be subject to MVCC GC just like
|
||
|
point keys — this iterator may be useful for that purpose.
|
||
|
|
||
|
The `pebble.Iterator` type will be extended to provide accessors for
|
||
|
range keys for use in the combined and exclusively range iteration
|
||
|
modes.
|
||
|
|
||
|
```
|
||
|
// HasPointAndRange indicates whether there exists a point key, a range key or
|
||
|
// both at the current iterator position.
|
||
|
HasPointAndRange() (hasPoint, hasRange bool)
|
||
|
|
||
|
// RangeKeyChanged indicates whether the most recent iterator positioning
|
||
|
// operation resulted in the iterator stepping into or out of a new range key.
|
||
|
// If true previously returned range key bounds and data has been invalidated.
|
||
|
// If false, previously obtained range key bounds, suffix and value slices are
|
||
|
// still valid and may continue to be read.
|
||
|
RangeKeyChanged() bool
|
||
|
|
||
|
// Key returns the key of the current key/value pair, or nil if done. If
|
||
|
// positioned at an iterator position that only holds a range key, Key()
|
||
|
// always returns the start bound of the range key. Otherwise, it returns
|
||
|
// the point key's key.
|
||
|
Key() []byte
|
||
|
|
||
|
// RangeBounds returns the start (inclusive) and end (exclusive) bounds of the
|
||
|
// range key covering the current iterator position. RangeBounds returns nil
|
||
|
// bounds if there is no range key covering the current iterator position, or
|
||
|
// the iterator is not configured to surface range keys.
|
||
|
//
|
||
|
// If valid, the returned start bound is less than or equal to Key() and the
|
||
|
// returned end bound is greater than Key().
|
||
|
RangeBounds() (start, end []byte)
|
||
|
|
||
|
// Value returns the value of the current key/value pair, or nil if done.
|
||
|
// The caller should not modify the contents of the returned slice, and
|
||
|
// its contents may change on the next call to Next.
|
||
|
//
|
||
|
// Only valid if HasPointAndRange() returns true for hasPoint.
|
||
|
Value() []byte
|
||
|
|
||
|
// RangeKeys returns the range key values and their suffixes covering the
|
||
|
// current iterator position. The range bounds may be retrieved separately
|
||
|
// through RangeBounds().
|
||
|
RangeKeys() []RangeKey
|
||
|
|
||
|
type RangeKey struct {
|
||
|
Suffix []byte
|
||
|
Value []byte
|
||
|
}
|
||
|
```
|
||
|
|
||
|
When a combined iterator exposes range keys, it exposes all the range
|
||
|
keys covering `Key`. During iteration with a combined iterator, an
|
||
|
iteration position may surface just a point key, just a range key or
|
||
|
both at the currently-positioned `Key`.
|
||
|
|
||
|
Described another way, a Pebble combined iterator guarantees that it
|
||
|
will stop at all positions within the keyspace where:
|
||
|
1. There exists a point key at that position.
|
||
|
2. There exists a range key that logically begins at that postition.
|
||
|
|
||
|
In addition to the above positions, a Pebble iterator may also stop at keys
|
||
|
in-between the above positions due to fragmentation. Range keys are defined over
|
||
|
continuous spans of keyspace. Range keys with different suffix values may
|
||
|
overlap each other arbitrarily. To surface these arbitrarily overlapping spans
|
||
|
in an understandable and efficient way, the Pebble iterator surfaces range keys
|
||
|
fragmented at intersection points. Consider the following sequence of writes:
|
||
|
|
||
|
```
|
||
|
RangeKeySet([a,z), @1, 'apple')
|
||
|
RangeKeySet([c,e), @3, 'banana')
|
||
|
RangeKeySet([e,m), @5, 'orange')
|
||
|
RangeKeySet([b,k), @7, 'kiwi')
|
||
|
```
|
||
|
|
||
|
This yields a database containing overlapping range keys:
|
||
|
```
|
||
|
@7 → kiwi |-----------------)
|
||
|
@5 → orange |---------------)
|
||
|
@3 → banana |---)
|
||
|
@1 → apple |-------------------------------------------------)
|
||
|
a b c d e f g h i j k l m n o p q r s t u v w x y z
|
||
|
```
|
||
|
|
||
|
During iteration, these range keys are surfaced using the bounds of their
|
||
|
intersection points. For example, a scan across the keyspace containing only
|
||
|
these range keys would observe the following iterator positions:
|
||
|
|
||
|
```
|
||
|
Key() = a RangeKeyBounds() = [a,b) RangeKeys() = {(@1,apple)}
|
||
|
Key() = b RangeKeyBounds() = [b,c) RangeKeys() = {(@7,kiwi), (@1,apple)}
|
||
|
Key() = c RangeKeyBounds() = [c,e) RangeKeys() = {(@7,kiwi), (@3,banana), (@1,apple)}
|
||
|
Key() = e RangeKeyBounds() = [e,k) RangeKeys() = {(@7,kiwi), (@5,orange), (@1,apple)}
|
||
|
Key() = k RangeKeyBounds() = [k,m) RangeKeys() = {(@5,orange), (@1,apple)}
|
||
|
Key() = m RangeKeyBounds() = [m,z) RangeKeys() = {(@1,apple)}
|
||
|
```
|
||
|
|
||
|
This fragmentation produces a more understandable interface, and avoids forcing
|
||
|
iterators to read all range keys within the bounds of the broadest range key.
|
||
|
Consider this example:
|
||
|
|
||
|
```
|
||
|
iterator pos [ ] - sstable bounds
|
||
|
|
|
||
|
L1: [a----v1@t2--|-h] [l-----unset@t1----u]
|
||
|
L2: [e---|------v1@t1----------r]
|
||
|
a b c d e f g h i j k l m n o p q r s t u v w x y z
|
||
|
```
|
||
|
|
||
|
If the iterator is positioned at a point key `g`, there are two overlapping
|
||
|
physical range keys: `[a,h)@t2→v1` and `[e,r)@t1→v1`.
|
||
|
|
||
|
However, the `RangeKeyUnset([l,u), @t1)` removes part of the `[e,r)@t1→v1` range
|
||
|
key, truncating it to the bounds `[e,l)`. The iterator must return the truncated
|
||
|
bounds that correctly respect the `RangeKeyUnset`. However, when the range keys
|
||
|
are stored within a log-structured merge tree like Pebble, the `RangeKeyUnset`
|
||
|
may not be contained within the level's sstable that overlaps the current point
|
||
|
key. Searching for the unset could require reading an unbounded number of
|
||
|
sstables, losing the log-structured merge tree's property that bounds read
|
||
|
amplification to the number of levels in the tree.
|
||
|
|
||
|
Fragmenting range keys to intersection points avoids this problem. The iterator
|
||
|
positioned at `g` only surfaces range key state with the bounds `[e,h)`, the
|
||
|
widest bounds in which it can guarantee t2→v1 and t1→v1 without loading
|
||
|
additional sstables.
|
||
|
|
||
|
#### Iteration order
|
||
|
|
||
|
Recall that the user-provided `Comparer.Split(k)` function divides all user keys
|
||
|
into a prefix and a suffix, such that the prefix is `k[:Split(k)]`, and the
|
||
|
suffix is `k[Split(k):]`. If a key does not contain a suffix, the key equals the
|
||
|
prefix.
|
||
|
|
||
|
An iterator that is configured to surface range keys alongside point keys will
|
||
|
surface all range keys covering the current `Key()` position. Revisiting an
|
||
|
earlier example with the addition of three new point key-value pairs:
|
||
|
a→artichoke, b@2→beet and t@3→turnip. Consider '@<number>' to form the suffix
|
||
|
where present, with `<number>` denoting a MVCC timestamp. Higher, more-recent
|
||
|
timestamps sort before lower, older timestamps.
|
||
|
|
||
|
```
|
||
|
. a → artichoke
|
||
|
@7 → kiwi |-----------------)
|
||
|
@5 → orange |---------------)
|
||
|
. b@2 b@2 → beet
|
||
|
@3 → banana |---) . t@3 t@3 → turnip
|
||
|
@1 → apple |-------------------------------------------------)
|
||
|
a b c d e f g h i j k l m n o p q r s t u v w x y z
|
||
|
```
|
||
|
|
||
|
An iterator configured to surface both point and range keys will visit the
|
||
|
following iterator positions during forward iteration:
|
||
|
|
||
|
```
|
||
|
Key() HasPointAndRange() Value() RangeKeyBounds() RangeKeys()
|
||
|
a (true, true) artichoke [a,b) {(@1,apple)}
|
||
|
b (false, true) - [b,c) {(@7,kiwi), (@1,apple)}
|
||
|
b@2 (true, true) beet [b,c) {(@7,kiwi), (@1,apple)}
|
||
|
c (false, true) - [c,e) {(@7,kiwi), (@3,banana), (@1,apple)}
|
||
|
e (false, true) - [e,k) {(@7,kiwi), (@5,orange), (@1,apple)}
|
||
|
k (false, true) - [k,m) {(@5,orange), (@1,apple)}
|
||
|
m (false, true) - [m,z) {(@1,apple)}
|
||
|
t@3 (true, true) turnip [m,z) {(@1,apple)}
|
||
|
```
|
||
|
|
||
|
Note that:
|
||
|
|
||
|
- While positioned over a point key (eg, Key() = 'a', 'b@2' or t@3'), the
|
||
|
iterator exposes both the point key's value through Value() and the
|
||
|
overlapping range keys values through `RangeKeys()`.
|
||
|
|
||
|
- There can be multiple range keys covering a `Key()`, each with a different
|
||
|
suffix.
|
||
|
|
||
|
- There cannot be multiple range keys covering a `Key()` with the same suffix,
|
||
|
since the most-recently committed one (eg, the one with the highest sequence
|
||
|
number) will win, just like for point keys.
|
||
|
|
||
|
- If the iterator has configured lower and/or upper bounds, they will truncate
|
||
|
the range key to those bounds. For example, if the above iterator had an upper
|
||
|
bound 'y', the `[m,z)` range key would be surfaced with the bounds `[m,y)`
|
||
|
instead.
|
||
|
|
||
|
#### Masking
|
||
|
|
||
|
Range key masking provides additional, optional functionality designed
|
||
|
specifically for the use case of implementing a MVCC-compatible delete range.
|
||
|
|
||
|
When constructing an iterator that iterators over both point and range keys, a
|
||
|
user may request that range keys mask point keys. Masking is configured with a
|
||
|
suffix parameter that determines which range keys may mask point keys. Only
|
||
|
range keys with suffixes that sort after the mask's suffix mask point keys. A
|
||
|
range key that meets this condition only masks points with suffixes that sort
|
||
|
after the range key's suffix.
|
||
|
|
||
|
```
|
||
|
type IterOptions struct {
|
||
|
// ...
|
||
|
RangeKeyMasking RangeKeyMasking
|
||
|
}
|
||
|
|
||
|
// RangeKeyMasking configures automatic hiding of point keys by range keys.
|
||
|
// A non-nil Suffix enables range-key masking. When enabled, range keys with
|
||
|
// suffixes ≥ Suffix behave as masks. All point keys that are contained within
|
||
|
// a masking range key's bounds and have suffixes greater than the range key's
|
||
|
// suffix are automatically skipped.
|
||
|
//
|
||
|
// Specifically, when configured with a RangeKeyMasking.Suffix _s_, and there
|
||
|
// exists a range key with suffix _r_ covering a point key with suffix _p_, and
|
||
|
//
|
||
|
// _s_ ≤ _r_ < _p_
|
||
|
//
|
||
|
// then the point key is elided.
|
||
|
//
|
||
|
// Range-key masking may only be used when iterating over both point keys and
|
||
|
// range keys.
|
||
|
type RangeKeyMasking struct {
|
||
|
// Suffix configures which range keys may mask point keys. Only range keys
|
||
|
// that are defined at suffixes greater than or equal to Suffix will mask
|
||
|
// point keys.
|
||
|
Suffix []byte
|
||
|
// Filter is an optional field that may be used to improve performance of
|
||
|
// range-key masking through a block-property filter defined over key
|
||
|
// suffixes. If non-nil, Filter is called by Pebble to construct a
|
||
|
// block-property filter mask at iterator creation. The filter is used to
|
||
|
// skip whole point-key blocks containing point keys with suffixes greater
|
||
|
// than a covering range-key's suffix.
|
||
|
//
|
||
|
// To use this functionality, the caller must create and configure (through
|
||
|
// Options.BlockPropertyCollectors) a block-property collector that records
|
||
|
// the maxmimum suffix contained within a block. The caller then must write
|
||
|
// and provide a BlockPropertyFilterMask implementation on that same
|
||
|
// property. See the BlockPropertyFilterMask type for more information.
|
||
|
Filter func() BlockPropertyFilterMask
|
||
|
}
|
||
|
```
|
||
|
|
||
|
Example: A user may construct an iterator with `RangeKeyMasking.Suffix` set to
|
||
|
`@50`. The range key `[a, c)@60` would mask nothing, because `@60` is a more
|
||
|
recent timestamp than `@50`. However a range key `[a,c)@30` would mask `a@20`
|
||
|
and `apple@10` but not `apple@40`. A range key can only mask keys with MVCC
|
||
|
timestamps older than the range key's own timestamp. Only range keys with
|
||
|
suffixes (eg, MVCC timestamps) may mask anything at all.
|
||
|
|
||
|
The pebble Iterator surfaces all range keys when masking is enabled. Only point
|
||
|
keys are ever skipped, and only when they are contained within the bounds of a
|
||
|
range key with a more-recent suffix, and the range key's suffix is older than
|
||
|
the timestamp encoded in `RangeKeyMasking.Sufffix`.
|
||
|
|
||
|
## Implementation
|
||
|
|
||
|
### Write operations
|
||
|
|
||
|
This design introduces three new Pebble write operations: `RangeKeySet`,
|
||
|
`RangeKeyUnset` and `RangeKeyDelete`. Internally, these operations are
|
||
|
represented as internal keys with new corresponding key kinds encoded as a part
|
||
|
of the key trailer. These keys are stored within special range key blocks
|
||
|
separate from point keys, but within the same sstable. The range key blocks hold
|
||
|
`RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` keys, but do not hold keys
|
||
|
of any other kind. Within the memtables, these range keys are stored in a
|
||
|
separate skip list.
|
||
|
|
||
|
- `RangeKeySet([k1,k2), @suffix, value)` is encoded as a `k1.RANGEKEYSET` key
|
||
|
with a value encoding the tuple `(k2,@suffix,value)`.
|
||
|
- `RangeKeyUnset([k1,k2), @suffix)` is encoded as a `k1.RANGEUNSET` key
|
||
|
with a value encoding the tuple `(k2,@suffix)`.
|
||
|
- `RangeKeyDelete([k1,k2)` is encoded as a `k1.RANGEKEYDELETE` key with a value
|
||
|
encoding `k2`.
|
||
|
|
||
|
Range keys are physically fragmented as an artifact of the log-structured merge
|
||
|
tree structure and internal sstable boundaries. This fragmentation is essential
|
||
|
for preserving the performance characteristics of a log-structured merge tree.
|
||
|
Although the public interface operations for `RangeKeySet` and `RangeKeyUnset`
|
||
|
require both boundary keys `[k1,k2)` to always be bare prefixes (eg, to not have
|
||
|
a suffix), internally these keys may be fragmented to bounds containing
|
||
|
suffixes.
|
||
|
|
||
|
Example: If a user attempts to write `RangeKeySet([a@v1, c@v2), @v3, value)`,
|
||
|
Pebble will return an error to the user. If a user writes `RangeKeySet([a, c),
|
||
|
@v3, value)`, Pebble will allow the write and may later internally fragment the
|
||
|
`RangeKeySet` into three internal keys:
|
||
|
- `RangeKeySet([a, a@v1), @v3, value)`
|
||
|
- `RangeKeySet([a@v1, c@v2), @v3, value)`
|
||
|
- `RangeKeySet([c@v2, c), @v3, value)`
|
||
|
|
||
|
This fragmentation preserve log-structured merge tree performance
|
||
|
characteristics because it allows a range key to be split across many sstables,
|
||
|
while preserving locality between range keys and point keys. Consider a
|
||
|
`RangeKeySet([a,z), @1, foo)` on a database that contains millions of point keys
|
||
|
in the range [a,z). If the [a,z) range key was not permitted to be fragmented
|
||
|
internally, it would either need to be stored completely separately from the
|
||
|
point keys in a separate sstable or in a single intractably large sstable
|
||
|
containing all the overlapping point keys. Fragmentation allows locality,
|
||
|
ensuring point keys and range keys in the same region of the keyspace can be
|
||
|
stored in the same sstable.
|
||
|
|
||
|
`RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` keys are assigned sequence
|
||
|
numbers, like other internal keys. Log-structured merge tree level invariants
|
||
|
are valid across range key, point keys and between the two. That is:
|
||
|
|
||
|
1. The point key `k1#s2` cannot be at a lower level than `k2#s1` where
|
||
|
`k1==k2` and `s1 < s2`. This is the invariant implemented by all LSMs.
|
||
|
2. `RangeKeySet([k1,k2))#s2` cannot be at a lower level than
|
||
|
`RangeKeySet([k3,k4))#s1` where `[k1,k2)` overlaps `[k3,k4)` and `s1 < s2`.
|
||
|
3. `RangeKeySet([k1,k2))#s2` cannot be at a lower level than a point key
|
||
|
`k3#s1` where `k3 \in [k1,k2)` and `s1 < s2`.
|
||
|
|
||
|
Like other tombstones, the `RangeKeyUnset` and `RangeKeyDelete` keys are elided
|
||
|
when they fall to the bottomost level of the LSM and there is no snapshot
|
||
|
preventing its elision. There is no additional garbage collection problem
|
||
|
introduced by these keys.
|
||
|
|
||
|
There is no Merge operation that affects range keys.
|
||
|
|
||
|
#### Physical representation
|
||
|
|
||
|
`RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` keys are keyed by their
|
||
|
start key. This poses an obstacle. We must be able to support multiple range
|
||
|
keys at the same sequence number, because all keys within an ingested sstable
|
||
|
adopt the same sequence number. Duplicate internal keys (keys with equal user
|
||
|
keys, sequence numbers and kinds) are prohibited within Pebble. To resolve this
|
||
|
issue, fragments with the same bounds are merged within snapshot stripes into a
|
||
|
single physical key-value, representing multiple logical key-value pairs:
|
||
|
|
||
|
```
|
||
|
k1.RangeKeySet#s2 → (k2,[(@t2,v2),(@t1,v1)])
|
||
|
```
|
||
|
|
||
|
Within a physical key-value pair, suffix-value pairs are stored sorted by
|
||
|
suffix, descending. This has a minor advantage of reducing iteration-time
|
||
|
user-key comparisons when there exist multiple range keys in a table.
|
||
|
|
||
|
Unlike other Pebble keys, the `RangeKeySet` and `RangeKeyUnset` keys have values
|
||
|
that encode fields of data known to Pebble. The value that the user sets in a
|
||
|
call to `RangeKeySet` is opaque to Pebble, but the physical representation of
|
||
|
the `RangeKeySet`'s value is known. This encoding is a sequence of fields:
|
||
|
|
||
|
* End key, `varstring`, encodes the end user key of the fragment.
|
||
|
* A series of (suffix, value) tuples representing the logical range keys that
|
||
|
were merged into this one physical `RangeKeySet` key:
|
||
|
* Suffix, `varstring`
|
||
|
* Value, `varstring`
|
||
|
|
||
|
Similarly, `RangeKeyUnset` keys are merged within snapshot stripes and have a
|
||
|
physical representation like:
|
||
|
|
||
|
```
|
||
|
k1.RangeKeyUnset#s2 → (k2,[(@t2),(@t1)])
|
||
|
```
|
||
|
|
||
|
A `RangeKeyUnset` key's value is encoded as:
|
||
|
* End key, `varstring`, encodes the end user key of the fragment.
|
||
|
* A series of suffix `varstring`s.
|
||
|
|
||
|
When `RangeKeySet` and `RangeKeyUnset` fragments with identical bounds meet
|
||
|
within the same snapshot stripe within a compaction, any of the
|
||
|
`RangeKeyUnset`'s suffixes that exist within the `RangeKeySet` key are removed.
|
||
|
|
||
|
A `RangeKeyDelete` key has no additional data beyond its end key, which is
|
||
|
encoded directly in the value.
|
||
|
|
||
|
NB: `RangeKeySet` and `RangeKeyUnset` keys are not merged within batches or the
|
||
|
memtable. That's okay, because batches are append-only and indexed batches will
|
||
|
refragment and merge the range keys on-demand. In the memtable, every key is
|
||
|
guaranteed to have a unique sequence number.
|
||
|
|
||
|
### Sequence numbers
|
||
|
|
||
|
Like all Pebble keys, `RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` are
|
||
|
assigned sequence numbers when committed. As described above, overlapping
|
||
|
`RangeKeySet`s and `RangeKeyUnset`s are fragmented to have matching start and
|
||
|
end bounds. Then the resulting exactly-overlapping range key fragments are
|
||
|
merged into a single internal key-value pair, within the same snapshot stripe
|
||
|
and sstable. The original, unmerged internal keys each have their own sequence
|
||
|
numbers, indicating the moment they were committed within the history of all
|
||
|
write operations.
|
||
|
|
||
|
Recall that sequence numbers are used within Pebble to determine which keys
|
||
|
appear live to which iterators. When an iterator is constructed, it takes note
|
||
|
of the current _visible sequence number_, and for the lifetime of the iterator,
|
||
|
only surfaces keys less than that sequence number. Similarly, snapshots read the
|
||
|
current _visible sequence number_, remember it, but also leave a note asking
|
||
|
compactions to preserve history at that sequence number. The space between
|
||
|
snapshotted sequence numbers is referred to as a _snapshot stripe_, and
|
||
|
operations cannot drop or otherwise mutate keys unless they fall within the same
|
||
|
_snapshot stripe_. For example a `k.MERGE#5` key may not be merged with a
|
||
|
`k.MERGE#1` operation if there's an open snapshot at `#3`.
|
||
|
|
||
|
The new `RangeKeySet`, `RangeKeyUnset` and `RangeKeyDelete` keys behave
|
||
|
similarly. Overlapping range keys won't be merged if there's an open snapshot
|
||
|
separating them. Consider a range key `a-z` written at sequence number `#1` and
|
||
|
a point key `d.SET#2`. A combined point-and-range iterator using a sequence
|
||
|
number `#3` and positioned at `d` will surface both the range key `a-z` and the
|
||
|
point key `d`.
|
||
|
|
||
|
In the context of masking, the suffix-based masking of range keys can cause
|
||
|
potentially unexpected behavior. A range key `[a,z)@10` may be committed as
|
||
|
sequence number `#1`. Afterwards, a point key `d@5#2` may be committed. An
|
||
|
iterator that is configured with range-key masking with suffix `@20` would mask
|
||
|
the point key `d@5#2` because although `d@5#2`'s sequence number is higher,
|
||
|
range-key masking uses suffixes to impose order, not sequence numbers.
|
||
|
|
||
|
### Boundaries for sstables
|
||
|
|
||
|
Range keys follow the same relationship to sstable bounadries as the existing
|
||
|
`RANGEDEL` tombstones. The bounds of an internal range key are user keys. Every
|
||
|
range key is limited by its containing sstable's bounds.
|
||
|
|
||
|
Consider these keys, annotated with sequence numbers:
|
||
|
|
||
|
```
|
||
|
Point keys: a#50, b#70, b#49, b#48, c#47, d#46, e#45, f#44
|
||
|
Range key: [a,e)#60
|
||
|
```
|
||
|
|
||
|
We have created three versions of `b` in this example. In previous versions,
|
||
|
Pebble could split output sstables during a compaction such that the different
|
||
|
`b` versions span more than one sstable. This creates problems for `RANGEDEL`s
|
||
|
which span these two sstables which are discussed in the section on [improperly
|
||
|
truncated RANGEDELS](https://github.com/cockroachdb/pebble/blob/master/docs/range_deletions.md#improperly-truncated-range-deletes).
|
||
|
We manage to tolerate this for `RANGEDEL`s since their semantics are defined by
|
||
|
the system, which is not true for these range keys where the actual semantics
|
||
|
are up to the user.
|
||
|
|
||
|
Pebble now disallows such sstable split points. In this example, by postponing
|
||
|
the sstable split point to the user key c, we can cleanly split the range key
|
||
|
into `[a,c)#60` and `[c,e)#60`. The sstable end bound for the first sstable
|
||
|
(sstable bounds are inclusive) will be c#inf (where inf is the largest possible
|
||
|
seqnum, which is unused except for these cases), and sstable start bound for the
|
||
|
second sstable will be c#60.
|
||
|
|
||
|
The above example deals exclusively with point and range keys without suffixes.
|
||
|
Consider this example with suffixed keys, and compaction outputs split in the
|
||
|
middle of the `b` prefix:
|
||
|
|
||
|
```
|
||
|
first sstable: points: a@100, a@30, b@100, b@40 ranges: [a,c)@50
|
||
|
second sstable: points: b@30, c@40, d@40, e@30, ranges: [c,e)@50
|
||
|
```
|
||
|
|
||
|
When the compaction code decides to defer `b@30` to the next sstable and finish
|
||
|
the first sstable, the range key `[a,c)@50` is sitting in the fragmenter. The
|
||
|
compaction must split the range key at the bounds determined by the user key.
|
||
|
The compaction uses the first point key of the next sstable, in this case
|
||
|
`b@30`, to truncate the range key. The compaction flushes the fragment
|
||
|
`[a,b@30)@50` to the first sstable and updates the existing fragment to begin at
|
||
|
`b@30`.
|
||
|
|
||
|
If a range key extends into the next file, the range key's truncated end is used
|
||
|
for the purposes of determining the sstable end boundary. The first sstable's
|
||
|
end boundary becomes `b@30#inf`, signifying the range key does not cover `b@30`.
|
||
|
The second sstable's start boundary is `b@30`.
|
||
|
|
||
|
### Block property collectors
|
||
|
|
||
|
Separate block property collectors may be configured to collect separate
|
||
|
properties about range keys. This is necessary for CockroachDB's MVCC block
|
||
|
property collectors to ensure the sstable-level properties are correct.
|
||
|
|
||
|
### Iteration
|
||
|
|
||
|
This design extends the `*pebble.Iterator` with the ability to iterate over
|
||
|
exclusively range keys, range keys and point keys together or exclusively point
|
||
|
keys (the previous behavior).
|
||
|
|
||
|
- Pebble already requires that the prefix `k` follows the same key validity
|
||
|
rules as `k@suffix`.
|
||
|
|
||
|
- Previously, Pebble did not require that a user key consisting of just a prefix
|
||
|
`k` sort before the same prefix with a non-empty suffix. CockroachDB has
|
||
|
adopted this behavior since it results in the following clean behavior:
|
||
|
`RANGEDEL` over [k1, k2) deletes all versioned keys which have prefixes in the
|
||
|
interval [k1, k2). Pebble will now require this behavior for all users using
|
||
|
MVCC keys. Specifically, it must hold that `Compare(k[:Split(k)], k) < 0` if
|
||
|
`Split(k) < len(k)`.
|
||
|
|
||
|
# TKTK: Discuss merging iterator
|
||
|
|
||
|
#### Determinism
|
||
|
|
||
|
Range keys will be split based on boundaries of sstables in an LSM. Users of an
|
||
|
LSM typically expect that two different LSMs with different sstable settings
|
||
|
that receive the same writes should output the same key-value pairs when
|
||
|
iterating. To provide this behavior, the iterator implementation may be
|
||
|
configured to defragment range keys during iteration time. The defragmentation
|
||
|
behavior would be:
|
||
|
|
||
|
- Two visible ranges `[k1,k2)@suffix1=>val1`, `[k2,k3)@suffix2=>val2` are
|
||
|
defragmented if suffix1==suffix2 and val1==val2, and become [k1,k3).
|
||
|
|
||
|
- Defragmentation during user iteration does not consider the sequence number.
|
||
|
This is necessary since LSM state can be exported to another LSM via the use
|
||
|
of sstable ingestion, which can collapse different seqnums to the same seqnum.
|
||
|
We would like both LSMs to look identical to the user when iterating.
|
||
|
|
||
|
The above defragmentation is conceptually simple, but hard to implement
|
||
|
efficiently, since it requires stepping ahead from the current position to
|
||
|
defragment range keys. This stepping ahead could switch sstables while there are
|
||
|
still points to be consumed in a previous sstable. This determinism is useful
|
||
|
for testing and verification purposes:
|
||
|
|
||
|
- Randomized and metamorphic testing is used extensively to reliably test
|
||
|
software including Pebble and CockroachDB. Defragmentation provides
|
||
|
the determinism necessary for this form of testing.
|
||
|
|
||
|
- CockroachDB's replica divergence detector requires a consistent view of the
|
||
|
database on each replica.
|
||
|
|
||
|
In order to provide determinism, Pebble constructs an internal range key
|
||
|
iterator stack that's separate from the point iterator stack, even when
|
||
|
performing combined iteration over both range and point keys. The separate range
|
||
|
key iterator allows the internal range key iterator to move independently of the
|
||
|
point key iterator. This allows the range key iterator to independently visit
|
||
|
adjacent sstables in order to defragment their range keys if necessary, without
|
||
|
repositioning the point iterator.
|
||
|
|
||
|
Two spans [k1,k2) and [k3, k4) of range keys are defragmented if their bounds
|
||
|
abut and their user observable-state is identical. That is, `k2==k3` and each
|
||
|
spans' contains exactly the same set of range key (<suffix>, <tuple>) pairs. In
|
||
|
order to support `RangeKeyUnset` and `RangeKeyDelete`, defragmentation must be
|
||
|
applied _after_ resolving unset and deletes.
|
||
|
|
||
|
#### Merging iteration
|
||
|
|
||
|
Recall that range keys are stored in the same sstables as point keys. In a
|
||
|
log-structured merge tree, these sstables are distributed across levels. Within
|
||
|
a level, sstables are non-overlapping but between levels sstables may overlap
|
||
|
arbitrarily. During iteration, keys across levels must be merged together. For
|
||
|
point keys, this is typically done with a heap.
|
||
|
|
||
|
Range keys too must be merged across levels, and the earlier described
|
||
|
fragmentation at intersection boundaries must be applied. To implement this, a
|
||
|
range key merging iterator is defined.
|
||
|
|
||
|
A merging iterator is initialized with an arbitrary number of child iterators
|
||
|
over fragmented spans. Each child iterator exposes fragmented range keys, such
|
||
|
that overlapping range keys are surfaced in a single span with a single set of
|
||
|
bounds. Range keys from one child iterator may overlap key spans from another
|
||
|
child iterator arbitrarily. The high-level algorithm is:
|
||
|
|
||
|
1. Initialize a heap with bound keys from child iterators' range keys.
|
||
|
2. Find the next [or previous, if in reverse] two unique user keys' from bounds.
|
||
|
3. Consider the span formed between the two unique user keys a candidate span.
|
||
|
4. Determine if any of the child iterators' spans overlap the candidate span.
|
||
|
4a. If any of the child iterator's current bounds are end keys (during
|
||
|
forward iteration) or start keys (during reverse iteration), then all the
|
||
|
spans with that bound overlap the candidate span.
|
||
|
4b. If no spans overlap, forget the smallest (forward iteration) or largest
|
||
|
(reverse iteration) unique user key and advance the iterators to the next
|
||
|
unique user key. Start again from 3.
|
||
|
|
||
|
Consider the example:
|
||
|
|
||
|
```
|
||
|
i0: b---d e-----h
|
||
|
i1: a---c h-----k
|
||
|
i2: a------------------------------p
|
||
|
|
||
|
fragments: a-b-c-d-e-----h-----k----------p
|
||
|
```
|
||
|
|
||
|
None of the individual child iterators contain a span with the exact bounds
|
||
|
[c,d), but the merging iterator must produce a span [c,d). To accomplish this,
|
||
|
the merging iterator visits every span between unique boundary user keys. In the
|
||
|
above example, this is:
|
||
|
|
||
|
```
|
||
|
[a,b), [b,c), [c,d), [d,e), [e, h), [h, k), [k, p)
|
||
|
```
|
||
|
|
||
|
The merging iterator first initializes the heap to prepare for iteration. The
|
||
|
description below discusses the mechanics of forward iteration after a call to
|
||
|
First, but the mechanics are similar for reverse iteration and other positioning
|
||
|
methods.
|
||
|
|
||
|
During a call to First, the heap is initialized by seeking every level to the
|
||
|
first bound of the first fragment. In the above example, this seeks the child
|
||
|
iterators to:
|
||
|
|
||
|
```
|
||
|
i0: (b, boundKindStart, [ [b,d) ])
|
||
|
i1: (a, boundKindStart, [ [a,c) ])
|
||
|
i2: (a, boundKindStart, [ [a,p) ])
|
||
|
```
|
||
|
|
||
|
After fixing up the heap, the root of the heap is the bound with the smallest
|
||
|
user key ('a' in the example). During forward iteration, the root of the heap's
|
||
|
user key is the start key of next merged span. The merging iterator records this
|
||
|
key as the start key. The heap may contain other levels with range keys that
|
||
|
also have the same user key as a bound of a range key, so the merging iterator
|
||
|
pulls from the heap until it finds the first bound greater than the recorded
|
||
|
start key.
|
||
|
|
||
|
In the above example, this results in the bounds `[a,b)` and child iterators in
|
||
|
the following positions:
|
||
|
|
||
|
```
|
||
|
i0: (b, boundKindStart, [ [b,d) ])
|
||
|
i1: (c, boundKindEnd, [ [a,c) ])
|
||
|
i2: (p, boundKindEnd, [ [a,p) ])
|
||
|
```
|
||
|
|
||
|
With the user key bounds of the next merged span established, the merging
|
||
|
iterator must determine which, if any, of the range keys overlap the span.
|
||
|
During forward iteration any child iterator that is now positioned at an end
|
||
|
boundary has an overlapping span. (Justification: The child iterator's end
|
||
|
boundary is ≥ the new end bound. The child iterator's range key's corresponding
|
||
|
start boundary must be ≤ the new start bound since there were no other user keys
|
||
|
between the new span's bounds. So the fragments associated with the iterator's
|
||
|
current end boundary have start and end bounds such that start ≤ <new start
|
||
|
bound> < <new end bound> ≤ end).
|
||
|
|
||
|
The merging iterator iterates over the levels, collecting keys from any child
|
||
|
iterators positioned at end boundaries. In the above example, i1 and i2 are
|
||
|
positioned at end boundaries, so the merging iterator collects the keys of [a,c)
|
||
|
and [a,p). These spans contain the merging iterator's [a,b) span, but they may
|
||
|
also extend beyond the new span's start and end. The merging iterator returns
|
||
|
the keys with the new start and end bounds, preserving the underlying keys'
|
||
|
sequence numbers, key kinds and values.
|
||
|
|
||
|
It may be the case that the merging iterator finds no levels positioned at span
|
||
|
end boundaries in which case the span overlaps with nothing. In this case the
|
||
|
merging iterator loops, repeating the above process again until it finds a span
|
||
|
that does contain keys.
|
||
|
|
||
|
#### Efficient masking
|
||
|
|
||
|
Recollect that in the earlier example from the iteration interface, during
|
||
|
forward iteration an iterator would output the following keys:
|
||
|
|
||
|
```
|
||
|
Key() HasPointAndRange() Value() RangeKeyBounds() RangeKeys()
|
||
|
a (true, true) artichoke [a,b) {(@1,apple)}
|
||
|
b (false, true) - [b,c) {(@7,kiwi), (@1,apple)}
|
||
|
b@2 (true, true) beet [b,c) {(@7,kiwi), (@1,apple)}
|
||
|
c (false, true) - [c,e) {(@7,kiwi), (@3,banana), (@1,apple)}
|
||
|
e (false, true) - [e,k) {(@7,kiwi), (@5,orange), (@1,apple)}
|
||
|
k (false, true) - [k,m) {(@5,orange), (@1,apple)}
|
||
|
m (false, true) - [m,z) {(@1,apple)}
|
||
|
t@3 (true, true) turnip [m,z) {(@1,apple)}
|
||
|
```
|
||
|
|
||
|
When implementing an MVCC "soft delete range" operation using range keys, the
|
||
|
range key `[b,k)@7→kiwi` may represent that all keys within the range [b,k) are
|
||
|
deleted at MVCC timestamp @7. During iteration, it would be desirable if the
|
||
|
caller could indicate that it does not want to observe any "soft deleted" point
|
||
|
keys, and the iterator can safely skip them. Note that in a MVCC system, whether
|
||
|
or not a key is soft deleted depends on the timestamp at which the database is
|
||
|
read.
|
||
|
|
||
|
This is implemented through "range key masking," where a range key may act as a
|
||
|
mask, hiding point keys with MVCC timestamps beneath the range key. This
|
||
|
iterator option requires that the client configure the iterator with a MVCC
|
||
|
timestamp `suffix` representing the timestamp at which history should be read.
|
||
|
All range keys with suffixes (MVCC timestamps) less than or equal to the
|
||
|
configured suffix serve as masks. All point keys with suffixes (MVCC timestamps)
|
||
|
less than a covering, masking range key's suffix are hidden.
|
||
|
|
||
|
Specifically, when configured with a RangeKeyMasking.Suffix _s_, and there
|
||
|
exists a range key with suffix _r_ covering a point key with suffix _p_, and _s_
|
||
|
≤ _r_ < _p_ then the point key is elided.
|
||
|
|
||
|
In the above example, if `RangeKeyMasking.Suffix` is set to `@7`, every range
|
||
|
key serves as a mask and the point key `b@2` is hidden during iteration because
|
||
|
it's contained within the masking `[b,k)@7→kiwi` range key. Note that `t@3`
|
||
|
would _not_ be masked, because its timestamp `@3` is more recent than the only
|
||
|
range key that covers it (`[a,z)@1→apple`).
|
||
|
|
||
|
If `RangeKeyMasking.Suffix` were set to `@6` (a historical, point-in-time read),
|
||
|
the `[b,k)@7→kiwi` range key would no longer serve as a mask, and `b@2` would be
|
||
|
visible.
|
||
|
|
||
|
To efficiently implement masking, we cannot rely on the LSM invariant since
|
||
|
`b@100` can be at a lower level than `[a,e)@50`. Instead, we build on
|
||
|
block-property filters, supporting special use of a MVCC timestamp block
|
||
|
property in order to skip blocks wholly containing point keys that are masked by
|
||
|
a range key. The client may configure a block-property collector to record the
|
||
|
highest MVCC timestamps of point keys within blocks.
|
||
|
|
||
|
During read time, when positioned within a range key with a suffix ≤
|
||
|
`RangeKeyMasking.Suffix`, the iterator configures sstable readers to use a
|
||
|
block-property filter to skip any blocks for which the highest MVCC timestamp is
|
||
|
less than the provided suffix. Additionally, these iterators must consult index
|
||
|
block bounds to ensure the block-property filter is not applied beyond the
|
||
|
bounds of the masking range key.
|
||
|
|
||
|
### CockroachDB use
|
||
|
|
||
|
CockroachDB initially will only use range keys to represent MVCC range
|
||
|
tombstones. See the MVCC range tombstones tech note for more details:
|
||
|
|
||
|
https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/mvcc-range-tombstones.md
|
||
|
|
||
|
### Alternatives
|
||
|
|
||
|
#### A1. Automatic elision of range keys that don't cover keys
|
||
|
|
||
|
We could decide that range keys:
|
||
|
|
||
|
- Don't contribute to `MVCCStats` themselves.
|
||
|
- May be elided by Pebble when they cover zero point keys.
|
||
|
|
||
|
This means that CockroachDB garbage collection does not need to explicitly
|
||
|
remove the range keys, only the point keys they deleted. This option is clean
|
||
|
when paired with `RANGEDEL`s dropping both point and range keys. CockroachDB can
|
||
|
issue `RANGEDEL`s whenever it wants to drop a contiguous swath of points, and
|
||
|
not worry about the fact that it might also need to update the MVCC stats for
|
||
|
overlapping range keys.
|
||
|
|
||
|
However, this option makes deterministic iteration over defragmented range keys
|
||
|
for replica divergence detection challenging, because internal fragmentation may
|
||
|
elide regions of a range key at any point. Producing a normalized form would
|
||
|
require storing state in the value (ie, the original start key) and
|
||
|
recalculating the smallest and largest extant covered point keys within the
|
||
|
range key and replica bounds. This would require maintaining _O_(range-keys)
|
||
|
state during the `storage.ComputeStatsForRange` pass over a replica's combined
|
||
|
point and range iterator.
|
||
|
|
||
|
This likely forces replica divergence detection to use other means (eg, altering
|
||
|
the checksum of covered points) to incorporate MVCC range tombstone state.
|
||
|
|
||
|
This option is also highly tailored to the MVCC Delete Range use case. Other
|
||
|
range key usages, like ranged intents, would not want this behavior, so we don't
|
||
|
consider it further.
|
||
|
|
||
|
#### A2. Separate LSM of range keys
|
||
|
|
||
|
There are two viable options for where to store range keys. They may be encoded
|
||
|
within the same sstables as points in separate blocks, or in separate sstables
|
||
|
forming a parallel range-key LSM. We examine the tradeoffs between storing range
|
||
|
keys in the same sstable in different blocks ("shared sstables") or separate
|
||
|
sstables forming a parallel LSM ("separate sstables"):
|
||
|
|
||
|
- Storing range keys in separate sstables is possible because the only
|
||
|
iteractions between range keys and point keys happens at a global level.
|
||
|
Masking is defined over suffixes. It may be extended to be defined over
|
||
|
sequence numbers too (see 'Sequence numbers' section below), but that is
|
||
|
optional. Unlike range deletion tombstones, range keys have no effect on point
|
||
|
keys during compactions.
|
||
|
|
||
|
- With separate sstables, reads may need to open additional sstable(s) and read
|
||
|
additional blocks. The number of additional sstables is the number of nonempty
|
||
|
levels in the range-key LSM, so it grows logarithmically with the number of
|
||
|
range keys. For each sstable, a read must read the index block and a data
|
||
|
block.
|
||
|
|
||
|
- With our expectation of few range keys, the range-key LSM is expected to be
|
||
|
small, with one or two levels. Heuristics around sstable boundaries may
|
||
|
prevent unnecessary range-key reads when there is no covering range key. Range
|
||
|
key sstables and blocks are expected to have much higher table and block cache
|
||
|
hit rates, since they are orders of magnitude less dense. Reads in any
|
||
|
overlapping point sstables all access the same range key sstables.
|
||
|
|
||
|
- With shared sstables, `SeekPrefixGE` cannot use bloom filters to entirely
|
||
|
eliminate sstables that contain range keys. Pebble does not always use bloom
|
||
|
filters in L6, so once a range key is compacted into L6 its impact to
|
||
|
`SeekPrefixGE` is lessened. With separate sstables, `SeekPrefixGE` can always
|
||
|
use bloom filters for point-key sstables. If there are any overlapping
|
||
|
range-key sstables, the read must read them.
|
||
|
|
||
|
- With shared sstables, range keys create dense sstable boundaries. A range key
|
||
|
spanning an sstable boundary leaves no gap between the sstables' bounds. This
|
||
|
can force ingested sstables into higher levels of the LSM, even if the
|
||
|
sstables' point key spans don't overlap. This problem was previously observed
|
||
|
with wide `RANGEDEL` tombstones and was mitigated by prioritizing compaction
|
||
|
of sstables that contain `RANGEDEL` keys. We could do the same with range
|
||
|
keys, but the write amplification is expected to be much worse. The `RANGEDEL`
|
||
|
tombstones drop keys and eventually are dropped themselves as long as there is
|
||
|
not an open snapshot. Range keys do not drop data and are expected to persist
|
||
|
in L6 for long durations, always requiring ingested sstables to be inserted
|
||
|
into L5 or above.
|
||
|
|
||
|
- With separate sstables, compaction logic is separate, which helps avoid
|
||
|
complexity of tricky sstable boundary conditions. Because there are expected
|
||
|
to be an order of magnitude fewer range keys, we could impose the constraint
|
||
|
that a prefix cannot be split across multiple range key sstables. The
|
||
|
simplified compaction logic comes at the cost of higher levels, iterators, etc
|
||
|
all needing to deal with the concept of two parallel LSMs.
|
||
|
|
||
|
- With shared sstables, the LSM invariant is maintained between range keys and
|
||
|
point keys. For example, if the point key `b@20` is committed, and
|
||
|
subsequently a range key `RangeKey([a,c), @25, ...)` is committed, the range
|
||
|
key will never fall below the covered point `b@20` within the LSM.
|
||
|
|
||
|
We decide to share sstables, because preserving the LSM invariant between range
|
||
|
keys and point keys is expected to be useful in the long-term.
|
||
|
|
||
|
#### A3. Sequence number masking
|
||
|
|
||
|
In the CockroachDB MVCC range tombstone use case, a point key should never be
|
||
|
written below an existing range key with a higher timestamp. The MVCC range
|
||
|
tombstone use case would allow us to dictate that an overlapping range key with
|
||
|
a higher sequence number always masks range keys with lower sequence numbers.
|
||
|
Adding this additional masking scope would avoid the comparatively costly suffix
|
||
|
comparison when a point key _is_ masked by a range key. We need to consider how
|
||
|
sequence number masking might be affected by the merging of range keys within
|
||
|
snapshot stripes.
|
||
|
|
||
|
Consider the committing of range key `[a,z)@{t1}#10`, followed by point keys
|
||
|
`d@t2#11` and `m@t2#11`, followed by range key `[j,z)@{t3}#12`. This sequencing
|
||
|
respects the expected timestamp, sequence number relationship in CockroachDB's
|
||
|
use case. If all keys are flushed within the same sstable, fragmentation and
|
||
|
merging overlapping fragments yields range keys `[a,j)@{t1}#10`,
|
||
|
`[j,z)@{t3,t1}#12`. The key `d@t2#11` must not be masked because it's not
|
||
|
covered by the new range key, and indeed that's the case because the covering
|
||
|
range key's fragment is unchanged `[a,j)@{t1}#10`.
|
||
|
|
||
|
For now we defer this optimization, with the expectation that we may not be able
|
||
|
to preserve this relationship between sequence numbers and suffixes in all range
|
||
|
key use cases.
|