Spaces:
Running
Running
reftable | |
-------- | |
Overview | |
~~~~~~~~ | |
Problem statement | |
^^^^^^^^^^^^^^^^^ | |
Some repositories contain a lot of references (e.g. android at 866k, | |
rails at 31k). The existing packed-refs format takes up a lot of space | |
(e.g. 62M), and does not scale with additional references. Lookup of a | |
single reference requires linearly scanning the file. | |
Atomic pushes modifying multiple references require copying the entire | |
packed-refs file, which can be a considerable amount of data moved | |
(e.g. 62M in, 62M out) for even small transactions (2 refs modified). | |
Repositories with many loose references occupy a large number of disk | |
blocks from the local file system, as each reference is its own file | |
storing 41 bytes (and another file for the corresponding reflog). This | |
negatively affects the number of inodes available when a large number of | |
repositories are stored on the same filesystem. Readers can be penalized | |
due to the larger number of syscalls required to traverse and read the | |
`$GIT_DIR/refs` directory. | |
Objectives | |
^^^^^^^^^^ | |
* Near constant time lookup for any single reference, even when the | |
repository is cold and not in process or kernel cache. | |
* Near constant time verification if an object name is referred to by at least | |
one reference (for allow-tip-sha1-in-want). | |
* Efficient enumeration of an entire namespace, such as `refs/tags/`. | |
* Support atomic push with `O(size_of_update)` operations. | |
* Combine reflog storage with ref storage for small transactions. | |
* Separate reflog storage for base refs and historical logs. | |
Description | |
^^^^^^^^^^^ | |
A reftable file is a portable binary file format customized for | |
reference storage. References are sorted, enabling linear scans, binary | |
search lookup, and range scans. | |
Storage in the file is organized into variable sized blocks. Prefix | |
compression is used within a single block to reduce disk space. Block | |
size and alignment is tunable by the writer. | |
Performance | |
^^^^^^^^^^^ | |
Space used, packed-refs vs. reftable: | |
[cols=",>,>,>,>,>",options="header",] | |
|=============================================================== | |
|repository |packed-refs |reftable |% original |avg ref |avg obj | |
|android |62.2 M |36.1 M |58.0% |33 bytes |5 bytes | |
|rails |1.8 M |1.1 M |57.7% |29 bytes |4 bytes | |
|git |78.7 K |48.1 K |61.0% |50 bytes |4 bytes | |
|git (heads) |332 b |269 b |81.0% |33 bytes |0 bytes | |
|=============================================================== | |
Scan (read 866k refs), by reference name lookup (single ref from 866k | |
refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs): | |
[cols=",>,>,>,>",options="header",] | |
|========================================================= | |
|format |cache |scan |by name |by SHA-1 | |
|packed-refs |cold |402 ms |409,660.1 usec |412,535.8 usec | |
|packed-refs |hot | |6,844.6 usec |20,110.1 usec | |
|reftable |cold |112 ms |33.9 usec |323.2 usec | |
|reftable |hot | |20.2 usec |320.8 usec | |
|========================================================= | |
Space used for 149,932 log entries for 43,061 refs, reflog vs. reftable: | |
[cols=",>,>",options="header",] | |
|================================ | |
|format |size |avg entry | |
|$GIT_DIR/logs |173 M |1209 bytes | |
|reftable |5 M |37 bytes | |
|================================ | |
Details | |
~~~~~~~ | |
Peeling | |
^^^^^^^ | |
References stored in a reftable are peeled, a record for an annotated | |
(or signed) tag records both the tag object, and the object it refers | |
to. This is analogous to storage in the packed-refs format. | |
Reference name encoding | |
^^^^^^^^^^^^^^^^^^^^^^^ | |
Reference names are an uninterpreted sequence of bytes that must pass | |
linkgit:git-check-ref-format[1] as a valid reference name. | |
Key unicity | |
^^^^^^^^^^^ | |
Each entry must have a unique key; repeated keys are disallowed. | |
Network byte order | |
^^^^^^^^^^^^^^^^^^ | |
All multi-byte, fixed width fields are in network byte order. | |
Varint encoding | |
^^^^^^^^^^^^^^^ | |
Varint encoding is identical to the ofs-delta encoding method used | |
within pack files. | |
Decoder works such as: | |
.... | |
val = buf[ptr] & 0x7f | |
while (buf[ptr] & 0x80) { | |
ptr++ | |
val = ((val + 1) << 7) | (buf[ptr] & 0x7f) | |
} | |
.... | |
Ordering | |
^^^^^^^^ | |
Blocks are lexicographically ordered by their first reference. | |
Directory/file conflicts | |
^^^^^^^^^^^^^^^^^^^^^^^^ | |
The reftable format accepts both `refs/heads/foo` and | |
`refs/heads/foo/bar` as distinct references. | |
This property is useful for retaining log records in reftable, but may | |
confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain | |
references. Users of reftable may choose to continue to reject `foo` and | |
`foo/bar` type conflicts to prevent problems for peers. | |
File format | |
~~~~~~~~~~~ | |
Structure | |
^^^^^^^^^ | |
A reftable file has the following high-level structure: | |
.... | |
first_block { | |
header | |
first_ref_block | |
} | |
ref_block* | |
ref_index* | |
obj_block* | |
obj_index* | |
log_block* | |
log_index* | |
footer | |
.... | |
A log-only file omits the `ref_block`, `ref_index`, `obj_block` and | |
`obj_index` sections, containing only the file header and log block: | |
.... | |
first_block { | |
header | |
} | |
log_block* | |
log_index* | |
footer | |
.... | |
in a log-only file the first log block immediately follows the file | |
header, without padding to block alignment. | |
Block size | |
^^^^^^^^^^ | |
The file's block size is arbitrarily determined by the writer, and does | |
not have to be a power of 2. The block size must be larger than the | |
longest reference name or log entry used in the repository, as | |
references cannot span blocks. | |
Powers of two that are friendly to the virtual memory system or | |
filesystem (such as 4k or 8k) are recommended. Larger sizes (64k) can | |
yield better compression, with a possible increased cost incurred by | |
readers during access. | |
The largest block size is `16777215` bytes (15.99 MiB). | |
Block alignment | |
^^^^^^^^^^^^^^^ | |
Writers may choose to align blocks at multiples of the block size by | |
including `padding` filled with NUL bytes at the end of a block to round | |
out to the chosen alignment. When alignment is used, writers must | |
specify the alignment with the file header's `block_size` field. | |
Block alignment is not required by the file format. Unaligned files must | |
set `block_size = 0` in the file header, and omit `padding`. Unaligned | |
files with more than one ref block must include the link:#Ref-index[ref | |
index] to support fast lookup. Readers must be able to read both aligned | |
and non-aligned files. | |
Very small files (e.g. a single ref block) may omit `padding` and the ref | |
index to reduce total file size. | |
Header (version 1) | |
^^^^^^^^^^^^^^^^^^ | |
A 24-byte header appears at the beginning of the file: | |
.... | |
'REFT' | |
uint8( version_number = 1 ) | |
uint24( block_size ) | |
uint64( min_update_index ) | |
uint64( max_update_index ) | |
.... | |
Aligned files must specify `block_size` to configure readers with the | |
expected block alignment. Unaligned files must set `block_size = 0`. | |
The `min_update_index` and `max_update_index` describe bounds for the | |
`update_index` field of all log records in this file. When reftables are | |
used in a stack for link:#Update-transactions[transactions], these | |
fields can order the files such that the prior file's | |
`max_update_index + 1` is the next file's `min_update_index`. | |
Header (version 2) | |
^^^^^^^^^^^^^^^^^^ | |
A 28-byte header appears at the beginning of the file: | |
.... | |
'REFT' | |
uint8( version_number = 2 ) | |
uint24( block_size ) | |
uint64( min_update_index ) | |
uint64( max_update_index ) | |
uint32( hash_id ) | |
.... | |
The header is identical to `version_number=1`, with the 4-byte hash ID | |
("sha1" for SHA1 and "s256" for SHA-256) append to the header. | |
For maximum backward compatibility, it is recommended to use version 1 when | |
writing SHA1 reftables. | |
First ref block | |
^^^^^^^^^^^^^^^ | |
The first ref block shares the same block as the file header, and is 24 | |
bytes smaller than all other blocks in the file. The first block | |
immediately begins after the file header, at position 24. | |
If the first block is a log block (a log-only file), its block header | |
begins immediately at position 24. | |
Ref block format | |
^^^^^^^^^^^^^^^^ | |
A ref block is written as: | |
.... | |
'r' | |
uint24( block_len ) | |
ref_record+ | |
uint24( restart_offset )+ | |
uint16( restart_count ) | |
padding? | |
.... | |
Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which | |
encodes the number of bytes in the block up to, but not including the | |
optional `padding`. This is always less than or equal to the file's | |
block size. In the first ref block, `block_len` includes 24 bytes for | |
the file header. | |
The 2-byte `restart_count` stores the number of entries in the | |
`restart_offset` list, which must not be empty. Readers can use | |
`restart_count` to binary search between restarts before starting a | |
linear scan. | |
Exactly `restart_count` 3-byte `restart_offset` values precedes the | |
`restart_count`. Offsets are relative to the start of the block and | |
refer to the first byte of any `ref_record` whose name has not been | |
prefix compressed. Entries in the `restart_offset` list must be sorted, | |
ascending. Readers can start linear scans from any of these records. | |
A variable number of `ref_record` fill the middle of the block, | |
describing reference names and values. The format is described below. | |
As the first ref block shares the first file block with the file header, | |
all `restart_offset` in the first block are relative to the start of the | |
file (position 0), and include the file header. This forces the first | |
`restart_offset` to be `28`. | |
ref record | |
++++++++++ | |
A `ref_record` describes a single reference, storing both the name and | |
its value(s). Records are formatted as: | |
.... | |
varint( prefix_length ) | |
varint( (suffix_length << 3) | value_type ) | |
suffix | |
varint( update_index_delta ) | |
value? | |
.... | |
The `prefix_length` field specifies how many leading bytes of the prior | |
reference record's name should be copied to obtain this reference's | |
name. This must be 0 for the first reference in any block, and also must | |
be 0 for any `ref_record` whose offset is listed in the `restart_offset` | |
table at the end of the block. | |
Recovering a reference name from any `ref_record` is a simple concat: | |
.... | |
this_name = prior_name[0..prefix_length] + suffix | |
.... | |
The `suffix_length` value provides the number of bytes available in | |
`suffix` to copy from `suffix` to complete the reference name. | |
The `update_index` that last modified the reference can be obtained by | |
adding `update_index_delta` to the `min_update_index` from the file | |
header: `min_update_index + update_index_delta`. | |
The `value` follows. Its format is determined by `value_type`, one of | |
the following: | |
* `0x0`: deletion; no value data (see transactions, below) | |
* `0x1`: one object name; value of the ref | |
* `0x2`: two object names; value of the ref, peeled target | |
* `0x3`: symbolic reference: `varint( target_len ) target` | |
Symbolic references use `0x3`, followed by the complete name of the | |
reference target. No compression is applied to the target name. | |
Types `0x4..0x7` are reserved for future use. | |
Ref index | |
^^^^^^^^^ | |
The ref index stores the name of the last reference from every ref block | |
in the file, enabling reduced disk seeks for lookups. Any reference can | |
be found by searching the index, identifying the containing block, and | |
searching within that block. | |
The index may be organized into a multi-level index, where the 1st level | |
index block points to additional ref index blocks (2nd level), which may | |
in turn point to either additional index blocks (e.g. 3rd level) or ref | |
blocks (leaf level). Disk reads required to access a ref go up with | |
higher index levels. Multi-level indexes may be required to ensure no | |
single index block exceeds the file format's max block size of | |
`16777215` bytes (15.99 MiB). To achieve constant O(1) disk seeks for | |
lookups the index must be a single level, which is permitted to exceed | |
the file's configured block size, but not the format's max block size of | |
15.99 MiB. | |
If present, the ref index block(s) appears after the last ref block. | |
If there are at least 4 ref blocks, a ref index block should be written | |
to improve lookup times. Cold reads using the index require 2 disk reads | |
(read index, read block), and binary searching < 4 blocks also requires | |
<= 2 reads. Omitting the index block from smaller files saves space. | |
If the file is unaligned and contains more than one ref block, the ref | |
index must be written. | |
Index block format: | |
.... | |
'i' | |
uint24( block_len ) | |
index_record+ | |
uint24( restart_offset )+ | |
uint16( restart_count ) | |
padding? | |
.... | |
The index blocks begin with `block_type = 'i'` and a 3-byte `block_len` | |
which encodes the number of bytes in the block, up to but not including | |
the optional `padding`. | |
The `restart_offset` and `restart_count` fields are identical in format, | |
meaning and usage as in ref blocks. | |
To reduce the number of reads required for random access in very large | |
files the index block may be larger than other blocks. However, readers | |
must hold the entire index in memory to benefit from this, so it's a | |
time-space tradeoff in both file size and reader memory. | |
Increasing the file's block size decreases the index size. Alternatively | |
a multi-level index may be used, keeping index blocks within the file's | |
block size, but increasing the number of blocks that need to be | |
accessed. | |
index record | |
++++++++++++ | |
An index record describes the last entry in another block. Index records | |
are written as: | |
.... | |
varint( prefix_length ) | |
varint( (suffix_length << 3) | 0 ) | |
suffix | |
varint( block_position ) | |
.... | |
Index records use prefix compression exactly like `ref_record`. | |
Index records store `block_position` after the suffix, specifying the | |
absolute position in bytes (from the start of the file) of the block | |
that ends with this reference. Readers can seek to `block_position` to | |
begin reading the block header. | |
Readers must examine the block header at `block_position` to determine | |
if the next block is another level index block, or the leaf-level ref | |
block. | |
Reading the index | |
+++++++++++++++++ | |
Readers loading the ref index must first read the footer (below) to | |
obtain `ref_index_position`. If not present, the position will be 0. The | |
`ref_index_position` is for the 1st level root of the ref index. | |
Obj block format | |
^^^^^^^^^^^^^^^^ | |
Object blocks are optional. Writers may choose to omit object blocks, | |
especially if readers will not use the object name to ref mapping. | |
Object blocks use unique, abbreviated 2-31 byte object name keys, mapping to | |
ref blocks containing references pointing to that object directly, or as | |
the peeled value of an annotated tag. Like ref blocks, object blocks use | |
the file's standard block size. The abbreviation length is available in | |
the footer as `obj_id_len`. | |
To save space in small files, object blocks may be omitted if the ref | |
index is not present, as brute force search will only need to read a few | |
ref blocks. When missing, readers should brute force a linear search of | |
all references to lookup by object name. | |
An object block is written as: | |
.... | |
'o' | |
uint24( block_len ) | |
obj_record+ | |
uint24( restart_offset )+ | |
uint16( restart_count ) | |
padding? | |
.... | |
Fields are identical to ref block. Binary search using the restart table | |
works the same as in reference blocks. | |
Because object names are abbreviated by writers to the shortest unique | |
abbreviation within the reftable, obj key lengths have a variable length. Their | |
length must be at least 2 bytes. Readers must compare only for common prefix | |
match within an obj block or obj index. | |
obj record | |
++++++++++ | |
An `obj_record` describes a single object abbreviation, and the blocks | |
containing references using that unique abbreviation: | |
.... | |
varint( prefix_length ) | |
varint( (suffix_length << 3) | cnt_3 ) | |
suffix | |
varint( cnt_large )? | |
varint( position_delta )* | |
.... | |
Like in reference blocks, abbreviations are prefix compressed within an | |
obj block. On large reftables with many unique objects, higher block | |
sizes (64k), and higher restart interval (128), a `prefix_length` of 2 | |
or 3 and `suffix_length` of 3 may be common in obj records (unique | |
abbreviation of 5-6 raw bytes, 10-12 hex digits). | |
Each record contains `position_count` number of positions for matching | |
ref blocks. For 1-7 positions the count is stored in `cnt_3`. When | |
`cnt_3 = 0` the actual count follows in a varint, `cnt_large`. | |
The use of `cnt_3` bets most objects are pointed to by only a single | |
reference, some may be pointed to by a couple of references, and very | |
few (if any) are pointed to by more than 7 references. | |
A special case exists when `cnt_3 = 0` and `cnt_large = 0`: there are no | |
`position_delta`, but at least one reference starts with this | |
abbreviation. A reader that needs exact reference names must scan all | |
references to find which specific references have the desired object. | |
Writers should use this format when the `position_delta` list would have | |
overflowed the file's block size due to a high number of references | |
pointing to the same object. | |
The first `position_delta` is the position from the start of the file. | |
Additional `position_delta` entries are sorted ascending and relative to | |
the prior entry, e.g. a reader would perform: | |
.... | |
pos = position_delta[0] | |
prior = pos | |
for (j = 1; j < position_count; j++) { | |
pos = prior + position_delta[j] | |
prior = pos | |
} | |
.... | |
With a position in hand, a reader must linearly scan the ref block, | |
starting from the first `ref_record`, testing each reference's object names | |
(for `value_type = 0x1` or `0x2`) for full equality. Faster searching by | |
object name within a single ref block is not supported by the reftable format. | |
Smaller block sizes reduce the number of candidates this step must | |
consider. | |
Obj index | |
^^^^^^^^^ | |
The obj index stores the abbreviation from the last entry for every obj | |
block in the file, enabling reduced disk seeks for all lookups. It is | |
formatted exactly the same as the ref index, but refers to obj blocks. | |
The obj index should be present if obj blocks are present, as obj blocks | |
should only be written in larger files. | |
Readers loading the obj index must first read the footer (below) to | |
obtain `obj_index_position`. If not present, the position will be 0. | |
Log block format | |
^^^^^^^^^^^^^^^^ | |
Unlike ref and obj blocks, log blocks are always unaligned. | |
Log blocks are variable in size, and do not match the `block_size` | |
specified in the file header or footer. Writers should choose an | |
appropriate buffer size to prepare a log block for deflation, such as | |
`2 * block_size`. | |
A log block is written as: | |
.... | |
'g' | |
uint24( block_len ) | |
zlib_deflate { | |
log_record+ | |
uint24( restart_offset )+ | |
uint16( restart_count ) | |
} | |
.... | |
Log blocks look similar to ref blocks, except `block_type = 'g'`. | |
The 4-byte block header is followed by the deflated block contents using | |
zlib deflate. The `block_len` in the header is the inflated size | |
(including 4-byte block header), and should be used by readers to | |
preallocate the inflation output buffer. A log block's `block_len` may | |
exceed the file's block size. | |
Offsets within the log block (e.g. `restart_offset`) still include the | |
4-byte header. Readers may prefer prefixing the inflation output buffer | |
with the 4-byte header. | |
Within the deflate container, a variable number of `log_record` describe | |
reference changes. The log record format is described below. See ref | |
block format (above) for a description of `restart_offset` and | |
`restart_count`. | |
Because log blocks have no alignment or padding between blocks, readers | |
must keep track of the bytes consumed by the inflater to know where the | |
next log block begins. | |
log record | |
++++++++++ | |
Log record keys are structured as: | |
.... | |
ref_name '\0' reverse_int64( update_index ) | |
.... | |
where `update_index` is the unique transaction identifier. The | |
`update_index` field must be unique within the scope of a `ref_name`. | |
See the update transactions section below for further details. | |
The `reverse_int64` function inverses the value so lexicographical | |
ordering the network byte order encoding sorts the more recent records | |
with higher `update_index` values first: | |
.... | |
reverse_int64(int64 t) { | |
return 0xffffffffffffffff - t; | |
} | |
.... | |
Log records have a similar starting structure to ref and index records, | |
utilizing the same prefix compression scheme applied to the log record | |
key described above. | |
.... | |
varint( prefix_length ) | |
varint( (suffix_length << 3) | log_type ) | |
suffix | |
log_data { | |
old_id | |
new_id | |
varint( name_length ) name | |
varint( email_length ) email | |
varint( time_seconds ) | |
sint16( tz_offset ) | |
varint( message_length ) message | |
}? | |
.... | |
Log record entries use `log_type` to indicate what follows: | |
* `0x0`: deletion; no log data. | |
* `0x1`: standard git reflog data using `log_data` above. | |
The `log_type = 0x0` is mostly useful for `git stash drop`, removing an | |
entry from the reflog of `refs/stash` in a transaction file (below), | |
without needing to rewrite larger files. Readers reading a stack of | |
reflogs must treat this as a deletion. | |
For `log_type = 0x1`, the `log_data` section follows | |
linkgit:git-update-ref[1] logging and includes: | |
* two object names (old id, new id) | |
* varint string of committer's name | |
* varint string of committer's email | |
* varint time in seconds since epoch (Jan 1, 1970) | |
* 2-byte timezone offset in minutes (signed) | |
* varint string of message | |
`tz_offset` is the absolute number of minutes from GMT the committer was | |
at the time of the update. For example `GMT-0800` is encoded in reftable | |
as `sint16(-480)` and `GMT+0230` is `sint16(150)`. | |
The committer email does not contain `<` or `>`, it's the value normally | |
found between the `<>` in a git commit object header. | |
The `message_length` may be 0, in which case there was no message | |
supplied for the update. | |
Contrary to traditional reflog (which is a file), renames are encoded as | |
a combination of ref deletion and ref creation. A deletion is a log | |
record with a zero new_id, and a creation is a log record with a zero old_id. | |
Reading the log | |
+++++++++++++++ | |
Readers accessing the log must first read the footer (below) to | |
determine the `log_position`. The first block of the log begins at | |
`log_position` bytes since the start of the file. The `log_position` is | |
not block aligned. | |
Importing logs | |
++++++++++++++ | |
When importing from `$GIT_DIR/logs` writers should globally order all | |
log records roughly by timestamp while preserving file order, and assign | |
unique, increasing `update_index` values for each log line. Newer log | |
records get higher `update_index` values. | |
Although an import may write only a single reftable file, the reftable | |
file must span many unique `update_index`, as each log line requires its | |
own `update_index` to preserve semantics. | |
Log index | |
^^^^^^^^^ | |
The log index stores the log key | |
(`refname \0 reverse_int64(update_index)`) for the last log record of | |
every log block in the file, supporting bounded-time lookup. | |
A log index block must be written if 2 or more log blocks are written to | |
the file. If present, the log index appears after the last log block. | |
There is no padding used to align the log index to block alignment. | |
Log index format is identical to ref index, except the keys are 9 bytes | |
longer to include `'\0'` and the 8-byte `reverse_int64(update_index)`. | |
Records use `block_position` to refer to the start of a log block. | |
Reading the index | |
+++++++++++++++++ | |
Readers loading the log index must first read the footer (below) to | |
obtain `log_index_position`. If not present, the position will be 0. | |
Footer | |
^^^^^^ | |
After the last block of the file, a file footer is written. It begins | |
like the file header, but is extended with additional data. | |
.... | |
HEADER | |
uint64( ref_index_position ) | |
uint64( (obj_position << 5) | obj_id_len ) | |
uint64( obj_index_position ) | |
uint64( log_position ) | |
uint64( log_index_position ) | |
uint32( CRC-32 of above ) | |
.... | |
If a section is missing (e.g. ref index) the corresponding position | |
field (e.g. `ref_index_position`) will be 0. | |
* `obj_position`: byte position for the first obj block. | |
* `obj_id_len`: number of bytes used to abbreviate object names in | |
obj blocks. | |
* `log_position`: byte position for the first log block. | |
* `ref_index_position`: byte position for the start of the ref index. | |
* `obj_index_position`: byte position for the start of the obj index. | |
* `log_index_position`: byte position for the start of the log index. | |
The size of the footer is 68 bytes for version 1, and 72 bytes for | |
version 2. | |
Reading the footer | |
++++++++++++++++++ | |
Readers must first read the file start to determine the version | |
number. Then they seek to `file_length - FOOTER_LENGTH` to access the | |
footer. A trusted external source (such as `stat(2)`) is necessary to | |
obtain `file_length`. When reading the footer, readers must verify: | |
* 4-byte magic is correct | |
* 1-byte version number is recognized | |
* 4-byte CRC-32 matches the other 64 bytes (including magic, and | |
version) | |
Once verified, the other fields of the footer can be accessed. | |
Empty tables | |
++++++++++++ | |
A reftable may be empty. In this case, the file starts with a header | |
and is immediately followed by a footer. | |
Binary search | |
^^^^^^^^^^^^^ | |
Binary search within a block is supported by the `restart_offset` fields | |
at the end of the block. Readers can binary search through the restart | |
table to locate between which two restart points the sought reference or | |
key should appear. | |
Each record identified by a `restart_offset` stores the complete key in | |
the `suffix` field of the record, making the compare operation during | |
binary search straightforward. | |
Once a restart point lexicographically before the sought reference has | |
been identified, readers can linearly scan through the following record | |
entries to locate the sought record, terminating if the current record | |
sorts after (and therefore the sought key is not present). | |
Restart point selection | |
+++++++++++++++++++++++ | |
Writers determine the restart points at file creation. The process is | |
arbitrary, but every 16 or 64 records is recommended. Every 16 may be | |
more suitable for smaller block sizes (4k or 8k), every 64 for larger | |
block sizes (64k). | |
More frequent restart points reduces prefix compression and increases | |
space consumed by the restart table, both of which increase file size. | |
Less frequent restart points makes prefix compression more effective, | |
decreasing overall file size, with increased penalties for readers | |
walking through more records after the binary search step. | |
A maximum of `65535` restart points per block is supported. | |
Considerations | |
~~~~~~~~~~~~~~ | |
Lightweight refs dominate | |
^^^^^^^^^^^^^^^^^^^^^^^^^ | |
The reftable format assumes the vast majority of references are single | |
object names valued with common prefixes, such as Gerrit Code Review's | |
`refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many | |
lightweight tags in the `refs/tags/` namespace. | |
Annotated tags storing the peeled object cost an additional object name per | |
reference. | |
Low overhead | |
^^^^^^^^^^^^ | |
A reftable with very few references (e.g. git.git with 5 heads) is 269 | |
bytes for reftable, vs. 332 bytes for packed-refs. This supports | |
reftable scaling down for transaction logs (below). | |
Block size | |
^^^^^^^^^^ | |
For a Gerrit Code Review type repository with many change refs, larger | |
block sizes (64 KiB) and less frequent restart points (every 64) yield | |
better compression due to more references within the block compressing | |
against the prior reference. | |
Larger block sizes reduce the index size, as the reftable will require | |
fewer blocks to store the same number of references. | |
Minimal disk seeks | |
^^^^^^^^^^^^^^^^^^ | |
Assuming the index block has been loaded into memory, binary searching | |
for any single reference requires exactly 1 disk seek to load the | |
containing block. | |
Scans and lookups dominate | |
^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
Scanning all references and lookup by name (or namespace such as | |
`refs/heads/`) are the most common activities performed on repositories. | |
Object names are stored directly with references to optimize this use case. | |
Logs are infrequently read | |
^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
Logs are infrequently accessed, but can be large. Deflating log blocks | |
saves disk space, with some increased penalty at read time. | |
Logs are stored in an isolated section from refs, reducing the burden on | |
reference readers that want to ignore logs. Further, historical logs can | |
be isolated into log-only files. | |
Logs are read backwards | |
^^^^^^^^^^^^^^^^^^^^^^^ | |
Logs are frequently accessed backwards (most recent N records for master | |
to answer `master@{4}`), so log records are grouped by reference, and | |
sorted descending by update index. | |
Repository format | |
~~~~~~~~~~~~~~~~~ | |
Version 1 | |
^^^^^^^^^ | |
A repository must set its `$GIT_DIR/config` to configure reftable: | |
.... | |
[core] | |
repositoryformatversion = 1 | |
[extensions] | |
refStorage = reftable | |
.... | |
Layout | |
^^^^^^ | |
A collection of reftable files are stored in the `$GIT_DIR/reftable/` directory. | |
Their names should have a random element, such that each filename is globally | |
unique; this helps avoid spurious failures on Windows, where open files cannot | |
be removed or overwritten. It suggested to use | |
`${min_update_index}-${max_update_index}-${random}.ref` as a naming convention. | |
Log-only files use the `.log` extension, while ref-only and mixed ref | |
and log files use `.ref`. extension. | |
The stack ordering file is `$GIT_DIR/reftable/tables.list` and lists the | |
current files, one per line, in order, from oldest (base) to newest | |
(most recent): | |
.... | |
$ cat .git/reftable/tables.list | |
00000001-00000001-RANDOM1.log | |
00000002-00000002-RANDOM2.ref | |
00000003-00000003-RANDOM3.ref | |
.... | |
Readers must read `$GIT_DIR/reftable/tables.list` to determine which | |
files are relevant right now, and search through the stack in reverse | |
order (last reftable is examined first). | |
Reftable files not listed in `tables.list` may be new (and about to be | |
added to the stack by the active writer), or ancient and ready to be | |
pruned. | |
Backward compatibility | |
^^^^^^^^^^^^^^^^^^^^^^ | |
Older clients should continue to recognize the directory as a git | |
repository so they don't look for an enclosing repository in parent | |
directories. To this end, a reftable-enabled repository must contain the | |
following dummy files | |
* `.git/HEAD`, a regular file containing `ref: refs/heads/.invalid`. | |
* `.git/refs/`, a directory | |
* `.git/refs/heads`, a regular file | |
Readers | |
^^^^^^^ | |
Readers can obtain a consistent snapshot of the reference space by | |
following: | |
1. Open and read the `tables.list` file. | |
2. Open each of the reftable files that it mentions. | |
3. If any of the files is missing, goto 1. | |
4. Read from the now-open files as long as necessary. | |
Update transactions | |
^^^^^^^^^^^^^^^^^^^ | |
Although reftables are immutable, mutations are supported by writing a | |
new reftable and atomically appending it to the stack: | |
1. Acquire `tables.list.lock`. | |
2. Read `tables.list` to determine current reftables. | |
3. Select `update_index` to be most recent file's | |
`max_update_index + 1`. | |
4. Prepare temp reftable `tmp_XXXXXX`, including log entries. | |
5. Rename `tmp_XXXXXX` to `${update_index}-${update_index}-${random}.ref`. | |
6. Copy `tables.list` to `tables.list.lock`, appending file from (5). | |
7. Rename `tables.list.lock` to `tables.list`. | |
During step 4 the new file's `min_update_index` and `max_update_index` | |
are both set to the `update_index` selected by step 3. All log records | |
for the transaction use the same `update_index` in their keys. This | |
enables later correlation of which references were updated by the same | |
transaction. | |
Because a single `tables.list.lock` file is used to manage locking, the | |
repository is single-threaded for writers. Writers may have to busy-spin | |
(with backoff) around creating `tables.list.lock`, for up to an | |
acceptable wait period, aborting if the repository is too busy to | |
mutate. Application servers wrapped around repositories (e.g. Gerrit | |
Code Review) can layer their own lock/wait queue to improve fairness to | |
writers. | |
Reference deletions | |
^^^^^^^^^^^^^^^^^^^ | |
Deletion of any reference can be explicitly stored by setting the `type` | |
to `0x0` and omitting the `value` field of the `ref_record`. This serves | |
as a tombstone, overriding any assertions about the existence of the | |
reference from earlier files in the stack. | |
Compaction | |
^^^^^^^^^^ | |
A partial stack of reftables can be compacted by merging references | |
using a straightforward merge join across reftables, selecting the most | |
recent value for output, and omitting deleted references that do not | |
appear in remaining, lower reftables. | |
A compacted reftable should set its `min_update_index` to the smallest | |
of the input files' `min_update_index`, and its `max_update_index` | |
likewise to the largest input `max_update_index`. | |
For sake of illustration, assume the stack currently consists of | |
reftable files (from oldest to newest): A, B, C, and D. The compactor is | |
going to compact B and C, leaving A and D alone. | |
1. Obtain lock `tables.list.lock` and read the `tables.list` file. | |
2. Obtain locks `B.lock` and `C.lock`. Ownership of these locks | |
prevents other processes from trying to compact these files. | |
3. Release `tables.list.lock`. | |
4. Compact `B` and `C` into a temp file | |
`${min_update_index}-${max_update_index}_XXXXXX`. | |
5. Reacquire lock `tables.list.lock`. | |
6. Verify that `B` and `C` are still in the stack, in that order. This | |
should always be the case, assuming that other processes are adhering to | |
the locking protocol. | |
7. Rename `${min_update_index}-${max_update_index}_XXXXXX` to | |
`${min_update_index}-${max_update_index}-${random}.ref`. | |
8. Write the new stack to `tables.list.lock`, replacing `B` and `C` | |
with the file from (4). | |
9. Rename `tables.list.lock` to `tables.list`. | |
10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing | |
readers to backtrack. | |
This strategy permits compactions to proceed independently of updates. | |
Each reftable (compacted or not) is uniquely identified by its name, so | |
open reftables can be cached by their name. | |
Windows | |
^^^^^^^ | |
On windows, and other systems that do not allow deleting or renaming to open | |
files, compaction may succeed, but other readers may prevent obsolete tables | |
from being deleted. | |
On these platforms, the following strategy can be followed: on closing a | |
reftable stack, reload `tables.list`, and delete any tables no longer mentioned | |
in `tables.list`. | |
Irregular program exit may still leave about unused files. In this case, a | |
cleanup operation should proceed as follows: | |
* take a lock `tables.list.lock` to prevent concurrent modifications | |
* refresh the reftable stack, by reading `tables.list` | |
* for each `*.ref` file, remove it if | |
** it is not mentioned in `tables.list`, and | |
** its max update_index is not beyond the max update_index of the stack | |
Alternatives considered | |
~~~~~~~~~~~~~~~~~~~~~~~ | |
bzip packed-refs | |
^^^^^^^^^^^^^^^^ | |
`bzip2` can significantly shrink a large packed-refs file (e.g. 62 MiB | |
compresses to 23 MiB, 37%). However the bzip format does not support | |
random access to a single reference. Readers must inflate and discard | |
while performing a linear scan. | |
Breaking packed-refs into chunks (individually compressing each chunk) | |
would reduce the amount of data a reader must inflate, but still leaves | |
the problem of indexing chunks to support readers efficiently locating | |
the correct chunk. | |
Given the compression achieved by reftable's encoding, it does not seem | |
necessary to add the complexity of bzip/gzip/zlib. | |
Michael Haggerty's alternate format | |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
Michael Haggerty proposed | |
link:https://lore.kernel.org/git/CAMy9T_HCnyc1g8XWOOWhe7nN0aEFyyBskV2aOMb_fe%2BwGvEJ7A%40mail.gmail.com/[an | |
alternate] format to reftable on the Git mailing list. This format uses | |
smaller chunks, without the restart table, and avoids block alignment | |
with padding. Reflog entries immediately follow each ref, and are thus | |
interleaved between refs. | |
Performance testing indicates reftable is faster for lookups (51% | |
faster, 11.2 usec vs. 5.4 usec), although reftable produces a slightly | |
larger file (+ ~3.2%, 28.3M vs 29.2M): | |
[cols=">,>,>,>",options="header",] | |
|===================================== | |
|format |size |seek cold |seek hot | |
|mh-alt |28.3 M |23.4 usec |11.2 usec | |
|reftable |29.2 M |19.9 usec |5.4 usec | |
|===================================== | |
JGit Ketch RefTree | |
^^^^^^^^^^^^^^^^^^ | |
https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html[JGit Ketch] | |
proposed | |
link:https://lore.kernel.org/git/CAJo%3DhJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg%40mail.gmail.com/[RefTree], | |
an encoding of references inside Git tree objects stored as part of the | |
repository's object database. | |
The RefTree format adds additional load on the object database storage | |
layer (more loose objects, more objects in packs), and relies heavily on | |
the packer's delta compression to save space. Namespaces which are flat | |
(e.g. thousands of tags in refs/tags) initially create very large loose | |
objects, and so RefTree does not address the problem of copying many | |
references to modify a handful. | |
Flat namespaces are not efficiently searchable in RefTree, as tree | |
objects in canonical formatting cannot be binary searched. This fails | |
the need to handle a large number of references in a single namespace, | |
such as GitHub's `refs/pulls`, or a project with many tags. | |
LMDB | |
^^^^ | |
David Turner proposed | |
https://lore.kernel.org/git/[email protected]/[using | |
LMDB], as LMDB is lightweight (64k of runtime code) and GPL-compatible | |
license. | |
A downside of LMDB is its reliance on a single C implementation. This | |
makes embedding inside JGit (a popular reimplementation of Git) | |
difficult, and hoisting onto virtual storage (for JGit DFS) virtually | |
impossible. | |
A common format that can be supported by all major Git implementations | |
(git-core, JGit, libgit2) is strongly preferred. | |