Spaces:
Running
Running
GIT bitmap v1 format | |
==================== | |
== Pack and multi-pack bitmaps | |
Bitmaps store reachability information about the set of objects in a packfile, | |
or a multi-pack index (MIDX). The former is defined obviously, and the latter is | |
defined as the union of objects in packs contained in the MIDX. | |
A bitmap may belong to either one pack, or the repository's multi-pack index (if | |
it exists). A repository may have at most one bitmap. | |
An object is uniquely described by its bit position within a bitmap: | |
- If the bitmap belongs to a packfile, the __n__th bit corresponds to | |
the __n__th object in pack order. For a function `offset` which maps | |
objects to their byte offset within a pack, pack order is defined as | |
follows: | |
o1 <= o2 <==> offset(o1) <= offset(o2) | |
- If the bitmap belongs to a MIDX, the __n__th bit corresponds to the | |
__n__th object in MIDX order. With an additional function `pack` which | |
maps objects to the pack they were selected from by the MIDX, MIDX order | |
is defined as follows: | |
o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2) | |
+ | |
The ordering between packs is done according to the MIDX's .rev file. | |
Notably, the preferred pack sorts ahead of all other packs. | |
The on-disk representation (described below) of a bitmap is the same regardless | |
of whether or not that bitmap belongs to a packfile or a MIDX. The only | |
difference is the interpretation of the bits, which is described above. | |
Certain bitmap extensions are supported (see: Appendix B). No extensions are | |
required for bitmaps corresponding to packfiles. For bitmaps that correspond to | |
MIDXs, both the bit-cache and rev-cache extensions are required. | |
== On-disk format | |
* A header appears at the beginning: | |
4-byte signature: :: {'B', 'I', 'T', 'M'} | |
2-byte version number (network byte order): :: | |
The current implementation only supports version 1 | |
of the bitmap index (the same one as JGit). | |
2-byte flags (network byte order): :: | |
The following flags are supported: | |
** {empty} | |
BITMAP_OPT_FULL_DAG (0x1) REQUIRED: ::: | |
This flag must always be present. It implies that the | |
bitmap index has been generated for a packfile or | |
multi-pack index (MIDX) with full closure (i.e. where | |
every single object in the packfile/MIDX can find its | |
parent links inside the same packfile/MIDX). This is a | |
requirement for the bitmap index format, also present in | |
JGit, that greatly reduces the complexity of the | |
implementation. | |
** {empty} | |
BITMAP_OPT_HASH_CACHE (0x4): ::: | |
If present, the end of the bitmap file contains | |
`N` 32-bit name-hash values, one per object in the | |
pack/MIDX. The format and meaning of the name-hash is | |
described below. | |
** {empty} | |
BITMAP_OPT_LOOKUP_TABLE (0x10): ::: | |
If present, the end of the bitmap file contains a table | |
containing a list of `N` <commit_pos, offset, xor_row> | |
triplets. The format and meaning of the table is described | |
below. | |
+ | |
NOTE: Unlike the xor_offset used to compress an individual bitmap, | |
`xor_row` stores an *absolute* index into the lookup table, not a location | |
relative to the current entry. | |
4-byte entry count (network byte order): :: | |
The total count of entries (bitmapped commits) in this bitmap index. | |
20-byte checksum: :: | |
The SHA1 checksum of the pack/MIDX this bitmap index | |
belongs to. | |
* 4 EWAH bitmaps that act as type indexes | |
+ | |
Type indexes are serialized after the hash cache in the shape | |
of four EWAH bitmaps stored consecutively (see Appendix A for | |
the serialization format of an EWAH bitmap). | |
+ | |
There is a bitmap for each Git object type, stored in the following | |
order: | |
+ | |
- Commits | |
- Trees | |
- Blobs | |
- Tags | |
+ | |
In each bitmap, the `n`th bit is set to true if the `n`th object | |
in the packfile or multi-pack index is of that type. | |
+ | |
The obvious consequence is that the OR of all 4 bitmaps will result | |
in a full set (all bits set), and the AND of all 4 bitmaps will | |
result in an empty bitmap (no bits set). | |
* N entries with compressed bitmaps, one for each indexed commit | |
+ | |
Where `N` is the total amount of entries in this bitmap index. | |
Each entry contains the following: | |
** {empty} | |
4-byte object position (network byte order): :: | |
The position **in the index for the packfile or | |
multi-pack index** where the bitmap for this commit is | |
found. | |
** {empty} | |
1-byte XOR-offset: :: | |
The xor offset used to compress this bitmap. For an entry | |
in position `x`, a XOR offset of `y` means that the actual | |
bitmap representing this commit is composed by XORing the | |
bitmap for this entry with the bitmap in entry `x-y` (i.e. | |
the bitmap `y` entries before this one). | |
+ | |
NOTE: This compression can be recursive. In order to | |
XOR this entry with a previous one, the previous entry needs | |
to be decompressed first, and so on. | |
+ | |
The hard-limit for this offset is 160 (an entry can only be | |
xor'ed against one of the 160 entries preceding it). This | |
number is always positive, and hence entries are always xor'ed | |
with **previous** bitmaps, not bitmaps that will come afterwards | |
in the index. | |
** {empty} | |
1-byte flags for this bitmap: :: | |
At the moment the only available flag is `0x1`, which hints | |
that this bitmap can be re-used when rebuilding bitmap indexes | |
for the repository. | |
** The compressed bitmap itself, see Appendix A. | |
* {empty} | |
TRAILER: :: | |
Trailing checksum of the preceding contents. | |
== Appendix A: Serialization format for an EWAH bitmap | |
Ewah bitmaps are serialized in the same protocol as the JAVAEWAH | |
library, making them backwards compatible with the JGit | |
implementation: | |
- 4-byte number of bits of the resulting UNCOMPRESSED bitmap | |
- 4-byte number of words of the COMPRESSED bitmap, when stored | |
- N x 8-byte words, as specified by the previous field | |
+ | |
This is the actual content of the compressed bitmap. | |
- 4-byte position of the current RLW for the compressed | |
bitmap | |
All words are stored in network byte order for their corresponding | |
sizes. | |
The compressed bitmap is stored in a form of run-length encoding, as | |
follows. It consists of a concatenation of an arbitrary number of | |
chunks. Each chunk consists of one or more 64-bit words | |
H L_1 L_2 L_3 .... L_M | |
H is called RLW (run length word). It consists of (from lower to higher | |
order bits): | |
- 1 bit: the repeated bit B | |
- 32 bits: repetition count K (unsigned) | |
- 31 bits: literal word count M (unsigned) | |
The bitstream represented by the above chunk is then: | |
- K repetitions of B | |
- The bits stored in `L_1` through `L_M`. Within a word, bits at | |
lower order come earlier in the stream than those at higher | |
order. | |
The next word after `L_M` (if any) must again be a RLW, for the next | |
chunk. For efficient appending to the bitstream, the EWAH stores a | |
pointer to the last RLW in the stream. | |
== Appendix B: Optional Bitmap Sections | |
These sections may or may not be present in the `.bitmap` file; their | |
presence is indicated by the header flags section described above. | |
Name-hash cache | |
--------------- | |
If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains | |
a cache of 32-bit values, one per object in the pack/MIDX. The value at | |
position `i` is the hash of the pathname at which the `i`th object | |
(counting in index or multi-pack index order) in the pack/MIDX can be found. | |
This can be fed into the delta heuristics to compare objects with similar | |
pathnames. | |
The hash algorithm used is: | |
hash = 0; | |
while ((c = *name++)) | |
if (!isspace(c)) | |
hash = (hash >> 2) + (c << 24); | |
Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. | |
If implementations want to choose a different hashing scheme, they are | |
free to do so, but MUST allocate a new header flag (because comparing | |
hashes made under two different schemes would be pointless). | |
Commit lookup table | |
------------------- | |
If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)` | |
bytes (preceding the name-hash cache and trailing hash) of the `.bitmap` | |
file contains a lookup table specifying the information needed to get | |
the desired bitmap from the entries without parsing previous unnecessary | |
bitmaps. | |
For a `.bitmap` containing `nr_entries` reachability bitmaps, the table | |
contains a list of `nr_entries` <commit_pos, offset, xor_row> triplets | |
(sorted in the ascending order of `commit_pos`). The content of i'th | |
triplet is - | |
* {empty} | |
commit_pos (4 byte integer, network byte order): :: | |
It stores the object position of a commit (in the midx or pack | |
index). | |
* {empty} | |
offset (8 byte integer, network byte order): :: | |
The offset from which that commit's bitmap can be read. | |
* {empty} | |
xor_row (4 byte integer, network byte order): :: | |
The position of the triplet whose bitmap is used to compress | |
this one, or `0xffffffff` if no such bitmap exists. | |