Spaces:
Running
Running
Git Sparse-Index Design Document | |
================================ | |
The sparse-checkout feature allows users to focus a working directory on | |
a subset of the files at HEAD. The cone mode patterns, enabled by | |
`core.sparseCheckoutCone`, allow for very fast pattern matching to | |
discover which files at HEAD belong in the sparse-checkout cone. | |
Three important scale dimensions for a Git working directory are: | |
* `HEAD`: How many files are present at `HEAD`? | |
* Populated: How many files are within the sparse-checkout cone. | |
* Modified: How many files has the user modified in the working directory? | |
We will use big-O notation -- O(X) -- to denote how expensive certain | |
operations are in terms of these dimensions. | |
These dimensions are ordered by their magnitude: users (typically) modify | |
fewer files than are populated, and we can only populate files at `HEAD`. | |
Problems occur if there is an extreme imbalance in these dimensions. For | |
example, if `HEAD` contains millions of paths but the populated set has | |
only tens of thousands, then commands like `git status` and `git add` can | |
be dominated by operations that require O(`HEAD`) operations instead of | |
O(Populated). Primarily, the cost is in parsing and rewriting the index, | |
which is filled primarily with files at `HEAD` that are marked with the | |
`SKIP_WORKTREE` bit. | |
The sparse-index intends to take these commands that read and modify the | |
index from O(`HEAD`) to O(Populated). To do this, we need to modify the | |
index format in a significant way: add "sparse directory" entries. | |
With cone mode patterns, it is possible to detect when an entire | |
directory will have its contents outside of the sparse-checkout definition. | |
Instead of listing all of the files it contains as individual entries, a | |
sparse-index contains an entry with the directory name, referencing the | |
object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit. | |
If we need to discover the details for paths within that directory, we | |
can parse trees to find that list. | |
At time of writing, sparse-directory entries violate expectations about the | |
index format and its in-memory data structure. There are many consumers in | |
the codebase that expect to iterate through all of the index entries and | |
see only files. In fact, these loops expect to see a reference to every | |
staged file. One way to handle this is to parse trees to replace a | |
sparse-directory entry with all of the files within that tree as the index | |
is loaded. However, parsing trees is slower than parsing the index format, | |
so that is a slower operation than if we left the index alone. The plan is | |
to make all of these integrations "sparse aware" so this expansion through | |
tree parsing is unnecessary and they use fewer resources than when using a | |
full index. | |
The implementation plan below follows four phases to slowly integrate with | |
the sparse-index. The intention is to incrementally update Git commands to | |
interact safely with the sparse-index without significant slowdowns. This | |
may not always be possible, but the hope is that the primary commands that | |
users need in their daily work are dramatically improved. | |
Phase I: Format and initial speedups | |
------------------------------------ | |
During this phase, Git learns to enable the sparse-index and safely parse | |
one. Protections are put in place so that every consumer of the in-memory | |
data structure can operate with its current assumption of every file at | |
`HEAD`. | |
At first, every index parse will call a helper method, | |
`ensure_full_index()`, which scans the index for sparse-directory entries | |
(pointing to trees) and replaces them with the full list of paths (with | |
blob contents) by parsing tree objects. This will be slower in all cases. | |
The only noticeable change in behavior will be that the serialized index | |
file contains sparse-directory entries. | |
To start, we use a new required index extension, `sdir`, to allow | |
inserting sparse-directory entries into indexes with file format | |
versions 2, 3, and 4. This prevents Git versions that do not understand | |
the sparse-index from operating on one, while allowing tools that do not | |
understand the sparse-index to operate on repositories as long as they do | |
not interact with the index. A new format, index v5, will be introduced | |
that includes sparse-directory entries by default. It might also | |
introduce other features that have been considered for improving the | |
index, as well. | |
Next, consumers of the index will be guarded against operating on a | |
sparse-index by inserting calls to `ensure_full_index()` or | |
`expand_index_to_path()`. If a specific path is requested, then those will | |
be protected from within the `index_file_exists()` and `index_name_pos()` | |
API calls: they will call `ensure_full_index()` if necessary. The | |
intention here is to preserve existing behavior when interacting with a | |
sparse-checkout. We don't want a change to happen by accident, without | |
tests. Many of these locations may not need any change before removing the | |
guards, but we should not do so without tests to ensure the expected | |
behavior happens. | |
It may be desirable to _change_ the behavior of some commands in the | |
presence of a sparse index or more generally in any sparse-checkout | |
scenario. In such cases, these should be carefully communicated and | |
tested. No such behavior changes are intended during this phase. | |
During a scan of the codebase, not every iteration of the cache entries | |
needs an `ensure_full_index()` check. The basic reasons include: | |
1. The loop is scanning for entries with non-zero stage. These entries | |
are not collapsed into a sparse-directory entry. | |
2. The loop is scanning for submodules. These entries are not collapsed | |
into a sparse-directory entry. | |
3. The loop is part of the index API, especially around reading or | |
writing the format. | |
4. The loop is checking for correct order of cache entries and that is | |
correct if and only if the sparse-directory entries are in the correct | |
location. | |
5. The loop ignores entries with the `SKIP_WORKTREE` bit set, or is | |
otherwise already aware of sparse directory entries. | |
6. The sparse-index is disabled at this point when using the split-index | |
feature, so no effort is made to protect the split-index API. | |
Even after inserting these guards, we will keep expanding sparse-indexes | |
for most Git commands using the `command_requires_full_index` repository | |
setting. This setting will be on by default and disabled one builtin at a | |
time until we have sufficient confidence that all of the index operations | |
are properly guarded. | |
To complete this phase, the commands `git status` and `git add` will be | |
integrated with the sparse-index so that they operate with O(Populated) | |
performance. They will be carefully tested for operations within and | |
outside the sparse-checkout definition. | |
Phase II: Careful integrations | |
------------------------------ | |
This phase focuses on ensuring that all index extensions and APIs work | |
well with a sparse-index. This requires significant increases to our test | |
coverage, especially for operations that interact with the working | |
directory outside of the sparse-checkout definition. Some of these | |
behaviors may not be the desirable ones, such as some tests already | |
marked for failure in `t1092-sparse-checkout-compatibility.sh`. | |
The index extensions that may require special integrations are: | |
* FS Monitor | |
* Untracked cache | |
While integrating with these features, we should look for patterns that | |
might lead to better APIs for interacting with the index. Coalescing | |
common usage patterns into an API call can reduce the number of places | |
where sparse-directories need to be handled carefully. | |
Phase III: Important command speedups | |
------------------------------------- | |
At this point, the patterns for testing and implementing sparse-directory | |
logic should be relatively stable. This phase focuses on updating some of | |
the most common builtins that use the index to operate as O(Populated). | |
Here is a potential list of commands that could be valuable to integrate | |
at this point: | |
* `git commit` | |
* `git checkout` | |
* `git merge` | |
* `git rebase` | |
Hopefully, commands such as `git merge` and `git rebase` can benefit | |
instead from merge algorithms that do not use the index as a data | |
structure, such as the merge-ORT strategy. As these topics mature, we | |
may enable the ORT strategy by default for repositories using the | |
sparse-index feature. | |
Along with `git status` and `git add`, these commands cover the majority | |
of users' interactions with the working directory. In addition, we can | |
integrate with these commands: | |
* `git grep` | |
* `git rm` | |
These have been proposed as some whose behavior could change when in a | |
repo with a sparse-checkout definition. It would be good to include this | |
behavior automatically when using a sparse-index. Some clarity is needed | |
to make the behavior switch clear to the user. | |
This phase is the first where parallel work might be possible without too | |
much conflicts between topics. | |
Phase IV: The long tail | |
----------------------- | |
This last phase is less a "phase" and more "the new normal" after all of | |
the previous work. | |
To start, the `command_requires_full_index` option could be removed in | |
favor of expanding only when hitting an API guard. | |
There are many Git commands that could use special attention to operate as | |
O(Populated), while some might be so rare that it is acceptable to leave | |
them with additional overhead when a sparse-index is present. | |
Here are some commands that might be useful to update: | |
* `git sparse-checkout set` | |
* `git am` | |
* `git clean` | |
* `git stash` | |