Spaces:
Running
Running
Git Commit-Graph Design Notes | |
============================= | |
Git walks the commit graph for many reasons, including: | |
1. Listing and filtering commit history. | |
2. Computing merge bases. | |
These operations can become slow as the commit count grows. The merge | |
base calculation shows up in many user-facing commands, such as 'merge-base' | |
or 'status' and can take minutes to compute depending on history shape. | |
There are two main costs here: | |
1. Decompressing and parsing commits. | |
2. Walking the entire graph to satisfy topological order constraints. | |
The commit-graph file is a supplemental data structure that accelerates | |
commit graph walks. If a user downgrades or disables the 'core.commitGraph' | |
config setting, then the existing object database is sufficient. The file is stored | |
as "commit-graph" either in the .git/objects/info directory or in the info | |
directory of an alternate. | |
The commit-graph file stores the commit graph structure along with some | |
extra metadata to speed up graph walks. By listing commit OIDs in | |
lexicographic order, we can identify an integer position for each commit | |
and refer to the parents of a commit using those integer positions. We | |
use binary search to find initial commits and then use the integer | |
positions for fast lookups during the walk. | |
A consumer may load the following info for a commit from the graph: | |
1. The commit OID. | |
2. The list of parents, along with their integer position. | |
3. The commit date. | |
4. The root tree OID. | |
5. The generation number (see definition below). | |
Values 1-4 satisfy the requirements of parse_commit_gently(). | |
There are two definitions of generation number: | |
1. Corrected committer dates (generation number v2) | |
2. Topological levels (generation number v1) | |
Define "corrected committer date" of a commit recursively as follows: | |
* A commit with no parents (a root commit) has corrected committer date | |
equal to its committer date. | |
* A commit with at least one parent has corrected committer date equal to | |
the maximum of its committer date and one more than the largest corrected | |
committer date among its parents. | |
* As a special case, a root commit with timestamp zero has corrected commit | |
date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO | |
(that is, an uncomputed corrected commit date). | |
Define the "topological level" of a commit recursively as follows: | |
* A commit with no parents (a root commit) has topological level of one. | |
* A commit with at least one parent has topological level one more than | |
the largest topological level among its parents. | |
Equivalently, the topological level of a commit A is one more than the | |
length of a longest path from A to a root commit. The recursive definition | |
is easier to use for computation and observing the following property: | |
If A and B are commits with generation numbers N and M, respectively, | |
and N <= M, then A cannot reach B. That is, we know without searching | |
that B is not an ancestor of A because it is further from a root commit | |
than A. | |
Conversely, when checking if A is an ancestor of B, then we only need | |
to walk commits until all commits on the walk boundary have generation | |
number at most N. If we walk commits using a priority queue seeded by | |
generation numbers, then we always expand the boundary commit with highest | |
generation number and can easily detect the stopping condition. | |
The property applies to both versions of generation number, that is both | |
corrected committer dates and topological levels. | |
This property can be used to significantly reduce the time it takes to | |
walk commits and determine topological relationships. Without generation | |
numbers, the general heuristic is the following: | |
If A and B are commits with commit time X and Y, respectively, and | |
X < Y, then A _probably_ cannot reach B. | |
In absence of corrected commit dates (for example, old versions of Git or | |
mixed generation graph chains), | |
this heuristic is currently used whenever the computation is allowed to | |
violate topological relationships due to clock skew (such as "git log" | |
with default order), but is not used when the topological order is | |
required (such as merge base calculations, "git log --graph"). | |
In practice, we expect some commits to be created recently and not stored | |
in the commit-graph. We can treat these commits as having "infinite" | |
generation number and walk until reaching commits with known generation | |
number. | |
We use the macro GENERATION_NUMBER_INFINITY to mark commits not | |
in the commit-graph file. If a commit-graph file was written by a version | |
of Git that did not compute generation numbers, then those commits will | |
have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. | |
Since the commit-graph file is closed under reachability, we can guarantee | |
the following weaker condition on all commits: | |
If A and B are commits with generation numbers N and M, respectively, | |
and N < M, then A cannot reach B. | |
Note how the strict inequality differs from the inequality when we have | |
fully-computed generation numbers. Using strict inequality may result in | |
walking a few extra commits, but the simplicity in dealing with commits | |
with generation number *_INFINITY or *_ZERO is valuable. | |
We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose | |
topological levels (generation number v1) are computed to be at least | |
this value. We limit at this value since it is the largest value that | |
can be stored in the commit-graph file using the 30 bits available | |
to topological levels. This presents another case where a commit can | |
have generation number equal to that of a parent. | |
Design Details | |
-------------- | |
- The commit-graph file is stored in a file named 'commit-graph' in the | |
.git/objects/info directory. This could be stored in the info directory | |
of an alternate. | |
- The core.commitGraph config setting must be on to consume graph files. | |
- The file format includes parameters for the object ID hash function, | |
so a future change of hash algorithm does not require a change in format. | |
- Commit grafts and replace objects can change the shape of the commit | |
history. The latter can also be enabled/disabled on the fly using | |
`--no-replace-objects`. This leads to difficultly storing both possible | |
interpretations of a commit id, especially when computing generation | |
numbers. The commit-graph will not be read or written when | |
replace-objects or grafts are present. | |
- Shallow clones create grafts of commits by dropping their parents. This | |
leads the commit-graph to think those commits have generation number 1. | |
If and when those commits are made unshallow, those generation numbers | |
become invalid. Since shallow clones are intended to restrict the commit | |
history to a very small set of commits, the commit-graph feature is less | |
helpful for these clones, anyway. The commit-graph will not be read or | |
written when shallow commits are present. | |
Commit-Graphs Chains | |
-------------------- | |
Typically, repos grow with near-constant velocity (commits per day). Over time, | |
the number of commits added by a fetch operation is much smaller than the | |
number of commits in the full history. By creating a "chain" of commit-graphs, | |
we enable fast writes of new commit data without rewriting the entire commit | |
history -- at least, most of the time. | |
## File Layout | |
A commit-graph chain uses multiple files, and we use a fixed naming convention | |
to organize these files. Each commit-graph file has a name | |
`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex- | |
valued hash stored in the footer of that file (which is a hash of the file's | |
contents before that hash). For a chain of commit-graph files, a plain-text | |
file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the | |
hashes for the files in order from "lowest" to "highest". | |
For example, if the `commit-graph-chain` file contains the lines | |
``` | |
{hash0} | |
{hash1} | |
{hash2} | |
``` | |
then the commit-graph chain looks like the following diagram: | |
+-----------------------+ | |
| graph-{hash2}.graph | | |
+-----------------------+ | |
| | |
+-----------------------+ | |
| | | |
| graph-{hash1}.graph | | |
| | | |
+-----------------------+ | |
| | |
+-----------------------+ | |
| | | |
| | | |
| | | |
| graph-{hash0}.graph | | |
| | | |
| | | |
| | | |
+-----------------------+ | |
Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of | |
commits in `graph-{hash1}.graph`, and X2 be the number of commits in | |
`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`, | |
then we interpret this as being the commit in position (X0 + X1 + i), and that | |
will be used as its "graph position". The commits in `graph-{hash2}.graph` use these | |
positions to refer to their parents, which may be in `graph-{hash1}.graph` or | |
`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking | |
its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 + | |
X2). | |
Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data | |
specifying the hashes of all files in the lower layers. In the above example, | |
`graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains | |
`{hash0}` and `{hash1}`. | |
## Merging commit-graph files | |
If we only added a new commit-graph file on every write, we would run into a | |
linear search problem through many commit-graph files. Instead, we use a merge | |
strategy to decide when the stack should collapse some number of levels. | |
The diagram below shows such a collapse. As a set of new commits are added, it | |
is determined by the merge strategy that the files should collapse to | |
`graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and | |
the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}` | |
file. | |
+---------------------+ | |
| | | |
| (new commits) | | |
| | | |
+---------------------+ | |
| | | |
+-----------------------+ +---------------------+ | |
| graph-{hash2} |->| | | |
+-----------------------+ +---------------------+ | |
| | | | |
+-----------------------+ +---------------------+ | |
| | | | | |
| graph-{hash1} |->| | | |
| | | | | |
+-----------------------+ +---------------------+ | |
| tmp_graphXXX | |
+-----------------------+ | |
| | | |
| | | |
| | | |
| graph-{hash0} | | |
| | | |
| | | |
| | | |
+-----------------------+ | |
During this process, the commits to write are combined, sorted and we write the | |
contents to a temporary file, all while holding a `commit-graph-chain.lock` | |
lock-file. When the file is flushed, we rename it to `graph-{hash3}` | |
according to the computed `{hash3}`. Finally, we write the new chain data to | |
`commit-graph-chain.lock`: | |
``` | |
{hash3} | |
{hash0} | |
``` | |
We then close the lock-file. | |
## Merge Strategy | |
When writing a set of commits that do not exist in the commit-graph stack of | |
height N, we default to creating a new file at level N + 1. We then decide to | |
merge with the Nth level if one of two conditions hold: | |
1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in | |
level N is less than X times the number of commits in level N + 1. | |
2. `--max-commits=<C>` is specified with non-zero C and the number of commits | |
in level N + 1 is more than C commits. | |
This decision cascades down the levels: when we merge a level we create a new | |
set of commits that then compares to the next level. | |
The first condition bounds the number of levels to be logarithmic in the total | |
number of commits. The second condition bounds the total number of commits in | |
a `graph-{hashN}` file and not in the `commit-graph` file, preventing | |
significant performance issues when the stack merges and another process only | |
partially reads the previous stack. | |
The merge strategy values (2 for the size multiple, 64,000 for the maximum | |
number of commits) could be extracted into config settings for full | |
flexibility. | |
## Handling Mixed Generation Number Chains | |
With the introduction of generation number v2 and generation data chunk, the | |
following scenario is possible: | |
1. "New" Git writes a commit-graph with the corrected commit dates. | |
2. "Old" Git writes a split commit-graph on top without corrected commit dates. | |
A naive approach of using the newest available generation number from | |
each layer would lead to violated expectations: the lower layer would | |
use corrected commit dates which are much larger than the topological | |
levels of the higher layer. For this reason, Git inspects the topmost | |
layer to see if the layer is missing corrected commit dates. In such a case | |
Git only uses topological level for generation numbers. | |
When writing a new layer in split commit-graph, we write corrected commit | |
dates if the topmost layer has corrected commit dates written. This | |
guarantees that if a layer has corrected commit dates, all lower layers | |
must have corrected commit dates as well. | |
When merging layers, we do not consider whether the merged layers had corrected | |
commit dates. Instead, the new layer will have corrected commit dates if the | |
layer below the new layer has corrected commit dates. | |
While writing or merging layers, if the new layer is the only layer, it will | |
have corrected commit dates when written by compatible versions of Git. Thus, | |
rewriting split commit-graph as a single file (`--split=replace`) creates a | |
single layer with corrected commit dates. | |
## Deleting graph-{hash} files | |
After a new tip file is written, some `graph-{hash}` files may no longer | |
be part of a chain. It is important to remove these files from disk, eventually. | |
The main reason to delay removal is that another process could read the | |
`commit-graph-chain` file before it is rewritten, but then look for the | |
`graph-{hash}` files after they are deleted. | |
To allow holding old split commit-graphs for a while after they are unreferenced, | |
we update the modified times of the files when they become unreferenced. Then, | |
we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}` | |
files whose modified times are older than a given expiry window. This window | |
defaults to zero, but can be changed using command-line arguments or a config | |
setting. | |
## Chains across multiple object directories | |
In a repo with alternates, we look for the `commit-graph-chain` file starting | |
in the local object directory and then in each alternate. The first file that | |
exists defines our chain. As we look for the `graph-{hash}` files for | |
each `{hash}` in the chain file, we follow the same pattern for the host | |
directories. | |
This allows commit-graphs to be split across multiple forks in a fork network. | |
The typical case is a large "base" repo with many smaller forks. | |
As the base repo advances, it will likely update and merge its commit-graph | |
chain more frequently than the forks. If a fork updates their commit-graph after | |
the base repo, then it should "reparent" the commit-graph chain onto the new | |
chain in the base repo. When reading each `graph-{hash}` file, we track | |
the object directory containing it. During a write of a new commit-graph file, | |
we check for any changes in the source object directory and read the | |
`commit-graph-chain` file for that source and create a new file based on those | |
files. During this "reparent" operation, we necessarily need to collapse all | |
levels in the fork, as all of the files are invalid against the new base file. | |
It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph` | |
files in this scenario. It falls to the user to define the proper settings for | |
their custom environment: | |
1. When merging levels in the base repo, the unreferenced files may still be | |
referenced by chains from fork repos. | |
2. The expiry time should be set to a length of time such that every fork has | |
time to recompute their commit-graph chain to "reparent" onto the new base | |
file(s). | |
3. If the commit-graph chain is updated in the base, the fork will not have | |
access to the new chain until its chain is updated to reference those files. | |
(This may change in the future [5].) | |
Related Links | |
------------- | |
[0] https://bugs.chromium.org/p/git/issues/detail?id=8 | |
Chromium work item for: Serialized Commit Graph | |
[1] https://lore.kernel.org/git/[email protected]/ | |
An abandoned patch that introduced generation numbers. | |
[2] https://lore.kernel.org/git/[email protected]/ | |
Discussion about generation numbers on commits and how they interact | |
with fsck. | |
[3] https://lore.kernel.org/git/[email protected]/ | |
More discussion about generation numbers and not storing them inside | |
commit objects. A valuable quote: | |
"I think we should be moving more in the direction of keeping | |
repo-local caches for optimizations. Reachability bitmaps have been | |
a big performance win. I think we should be doing the same with our | |
properties of commits. Not just generation numbers, but making it | |
cheap to access the graph structure without zlib-inflating whole | |
commit objects (i.e., packv4 or something like the "metapacks" I | |
proposed a few years ago)." | |
[4] https://lore.kernel.org/git/[email protected]/T/#u | |
A patch to remove the ahead-behind calculation from 'status'. | |
[5] https://lore.kernel.org/git/[email protected]/ | |
A discussion of a "two-dimensional graph position" that can allow reading | |
multiple commit-graph chains at the same time. | |