Spaces:
Running
Running
Partial Clone Design Notes | |
========================== | |
The "Partial Clone" feature is a performance optimization for Git that | |
allows Git to function without having a complete copy of the repository. | |
The goal of this work is to allow Git better handle extremely large | |
repositories. | |
During clone and fetch operations, Git downloads the complete contents | |
and history of the repository. This includes all commits, trees, and | |
blobs for the complete life of the repository. For extremely large | |
repositories, clones can take hours (or days) and consume 100+GiB of disk | |
space. | |
Often in these repositories there are many blobs and trees that the user | |
does not need such as: | |
1. files outside of the user's work area in the tree. For example, in | |
a repository with 500K directories and 3.5M files in every commit, | |
we can avoid downloading many objects if the user only needs a | |
narrow "cone" of the source tree. | |
2. large binary assets. For example, in a repository where large build | |
artifacts are checked into the tree, we can avoid downloading all | |
previous versions of these non-mergeable binary assets and only | |
download versions that are actually referenced. | |
Partial clone allows us to avoid downloading such unneeded objects *in | |
advance* during clone and fetch operations and thereby reduce download | |
times and disk usage. Missing objects can later be "demand fetched" | |
if/when needed. | |
A remote that can later provide the missing objects is called a | |
promisor remote, as it promises to send the objects when | |
requested. Initially Git supported only one promisor remote, the origin | |
remote from which the user cloned and that was configured in the | |
"extensions.partialClone" config option. Later support for more than | |
one promisor remote has been implemented. | |
Use of partial clone requires that the user be online and the origin | |
remote or other promisor remotes be available for on-demand fetching | |
of missing objects. This may or may not be problematic for the user. | |
For example, if the user can stay within the pre-selected subset of | |
the source tree, they may not encounter any missing objects. | |
Alternatively, the user could try to pre-fetch various objects if they | |
know that they are going offline. | |
Non-Goals | |
--------- | |
Partial clone is a mechanism to limit the number of blobs and trees downloaded | |
*within* a given range of commits -- and is therefore independent of and not | |
intended to conflict with existing DAG-level mechanisms to limit the set of | |
requested commits (i.e. shallow clone, single branch, or fetch '<refspec>'). | |
Design Overview | |
--------------- | |
Partial clone logically consists of the following parts: | |
- A mechanism for the client to describe unneeded or unwanted objects to | |
the server. | |
- A mechanism for the server to omit such unwanted objects from packfiles | |
sent to the client. | |
- A mechanism for the client to gracefully handle missing objects (that | |
were previously omitted by the server). | |
- A mechanism for the client to backfill missing objects as needed. | |
Design Details | |
-------------- | |
- A new pack-protocol capability "filter" is added to the fetch-pack and | |
upload-pack negotiation. | |
+ | |
This uses the existing capability discovery mechanism. | |
See "filter" in linkgit:gitprotocol-pack[5]. | |
- Clients pass a "filter-spec" to clone and fetch which is passed to the | |
server to request filtering during packfile construction. | |
+ | |
There are various filters available to accommodate different situations. | |
See "--filter=<filter-spec>" in Documentation/rev-list-options.txt. | |
- On the server pack-objects applies the requested filter-spec as it | |
creates "filtered" packfiles for the client. | |
+ | |
These filtered packfiles are *incomplete* in the traditional sense because | |
they may contain objects that reference objects not contained in the | |
packfile and that the client doesn't already have. For example, the | |
filtered packfile may contain trees or tags that reference missing blobs | |
or commits that reference missing trees. | |
- On the client these incomplete packfiles are marked as "promisor packfiles" | |
and treated differently by various commands. | |
- On the client a repository extension is added to the local config to | |
prevent older versions of git from failing mid-operation because of | |
missing objects that they cannot handle. | |
See "extensions.partialClone" in Documentation/technical/repository-version.txt" | |
Handling Missing Objects | |
------------------------ | |
- An object may be missing due to a partial clone or fetch, or missing | |
due to repository corruption. To differentiate these cases, the | |
local repository specially indicates such filtered packfiles | |
obtained from promisor remotes as "promisor packfiles". | |
+ | |
These promisor packfiles consist of a "<name>.promisor" file with | |
arbitrary contents (like the "<name>.keep" files), in addition to | |
their "<name>.pack" and "<name>.idx" files. | |
- The local repository considers a "promisor object" to be an object that | |
it knows (to the best of its ability) that promisor remotes have promised | |
that they have, either because the local repository has that object in one of | |
its promisor packfiles, or because another promisor object refers to it. | |
+ | |
When Git encounters a missing object, Git can see if it is a promisor object | |
and handle it appropriately. If not, Git can report a corruption. | |
+ | |
This means that there is no need for the client to explicitly maintain an | |
expensive-to-modify list of missing objects.[a] | |
- Since almost all Git code currently expects any referenced object to be | |
present locally and because we do not want to force every command to do | |
a dry-run first, a fallback mechanism is added to allow Git to attempt | |
to dynamically fetch missing objects from promisor remotes. | |
+ | |
When the normal object lookup fails to find an object, Git invokes | |
promisor_remote_get_direct() to try to get the object from a promisor | |
remote and then retry the object lookup. This allows objects to be | |
"faulted in" without complicated prediction algorithms. | |
+ | |
For efficiency reasons, no check as to whether the missing object is | |
actually a promisor object is performed. | |
+ | |
Dynamic object fetching tends to be slow as objects are fetched one at | |
a time. | |
- `checkout` (and any other command using `unpack-trees`) has been taught | |
to bulk pre-fetch all required missing blobs in a single batch. | |
- `rev-list` has been taught to print missing objects. | |
+ | |
This can be used by other commands to bulk prefetch objects. | |
For example, a "git log -p A..B" may internally want to first do | |
something like "git rev-list --objects --quiet --missing=print A..B" | |
and prefetch those objects in bulk. | |
- `fsck` has been updated to be fully aware of promisor objects. | |
- `repack` in GC has been updated to not touch promisor packfiles at all, | |
and to only repack other objects. | |
- The global variable "fetch_if_missing" is used to control whether an | |
object lookup will attempt to dynamically fetch a missing object or | |
report an error. | |
+ | |
We are not happy with this global variable and would like to remove it, | |
but that requires significant refactoring of the object code to pass an | |
additional flag. | |
Fetching Missing Objects | |
------------------------ | |
- Fetching of objects is done by invoking a "git fetch" subprocess. | |
- The local repository sends a request with the hashes of all requested | |
objects, and does not perform any packfile negotiation. | |
It then receives a packfile. | |
- Because we are reusing the existing fetch mechanism, fetching | |
currently fetches all objects referred to by the requested objects, even | |
though they are not necessary. | |
- Fetching with `--refetch` will request a complete new filtered packfile from | |
the remote, which can be used to change a filter without needing to | |
dynamically fetch missing objects. | |
Using many promisor remotes | |
--------------------------- | |
Many promisor remotes can be configured and used. | |
This allows for example a user to have multiple geographically-close | |
cache servers for fetching missing blobs while continuing to do | |
filtered `git-fetch` commands from the central server. | |
When fetching objects, promisor remotes are tried one after the other | |
until all the objects have been fetched. | |
Remotes that are considered "promisor" remotes are those specified by | |
the following configuration variables: | |
- `extensions.partialClone = <name>` | |
- `remote.<name>.promisor = true` | |
- `remote.<name>.partialCloneFilter = ...` | |
Only one promisor remote can be configured using the | |
`extensions.partialClone` config variable. This promisor remote will | |
be the last one tried when fetching objects. | |
We decided to make it the last one we try, because it is likely that | |
someone using many promisor remotes is doing so because the other | |
promisor remotes are better for some reason (maybe they are closer or | |
faster for some kind of objects) than the origin, and the origin is | |
likely to be the remote specified by extensions.partialClone. | |
This justification is not very strong, but one choice had to be made, | |
and anyway the long term plan should be to make the order somehow | |
fully configurable. | |
For now though the other promisor remotes will be tried in the order | |
they appear in the config file. | |
Current Limitations | |
------------------- | |
- It is not possible to specify the order in which the promisor | |
remotes are tried in other ways than the order in which they appear | |
in the config file. | |
+ | |
It is also not possible to specify an order to be used when fetching | |
from one remote and a different order when fetching from another | |
remote. | |
- It is not possible to push only specific objects to a promisor | |
remote. | |
+ | |
It is not possible to push at the same time to multiple promisor | |
remote in a specific order. | |
- Dynamic object fetching will only ask promisor remotes for missing | |
objects. We assume that promisor remotes have a complete view of the | |
repository and can satisfy all such requests. | |
- Repack essentially treats promisor and non-promisor packfiles as 2 | |
distinct partitions and does not mix them. | |
- Dynamic object fetching invokes fetch-pack once *for each item* | |
because most algorithms stumble upon a missing object and need to have | |
it resolved before continuing their work. This may incur significant | |
overhead -- and multiple authentication requests -- if many objects are | |
needed. | |
- Dynamic object fetching currently uses the existing pack protocol V0 | |
which means that each object is requested via fetch-pack. The server | |
will send a full set of info/refs when the connection is established. | |
If there are large number of refs, this may incur significant overhead. | |
Future Work | |
----------- | |
- Improve the way to specify the order in which promisor remotes are | |
tried. | |
+ | |
For example this could allow to specify explicitly something like: | |
"When fetching from this remote, I want to use these promisor remotes | |
in this order, though, when pushing or fetching to that remote, I want | |
to use those promisor remotes in that order." | |
- Allow pushing to promisor remotes. | |
+ | |
The user might want to work in a triangular work flow with multiple | |
promisor remotes that each have an incomplete view of the repository. | |
- Allow non-pathname-based filters to make use of packfile bitmaps (when | |
present). This was just an omission during the initial implementation. | |
- Investigate use of a long-running process to dynamically fetch a series | |
of objects, such as proposed in [5,6] to reduce process startup and | |
overhead costs. | |
+ | |
It would be nice if pack protocol V2 could allow that long-running | |
process to make a series of requests over a single long-running | |
connection. | |
- Investigate pack protocol V2 to avoid the info/refs broadcast on | |
each connection with the server to dynamically fetch missing objects. | |
- Investigate the need to handle loose promisor objects. | |
+ | |
Objects in promisor packfiles are allowed to reference missing objects | |
that can be dynamically fetched from the server. An assumption was | |
made that loose objects are only created locally and therefore should | |
not reference a missing object. We may need to revisit that assumption | |
if, for example, we dynamically fetch a missing tree and store it as a | |
loose object rather than a single object packfile. | |
+ | |
This does not necessarily mean we need to mark loose objects as promisor; | |
it may be sufficient to relax the object lookup or is-promisor functions. | |
Non-Tasks | |
--------- | |
- Every time the subject of "demand loading blobs" comes up it seems | |
that someone suggests that the server be allowed to "guess" and send | |
additional objects that may be related to the requested objects. | |
+ | |
No work has gone into actually doing that; we're just documenting that | |
it is a common suggestion. We're not sure how it would work and have | |
no plans to work on it. | |
+ | |
It is valid for the server to send more objects than requested (even | |
for a dynamic object fetch), but we are not building on that. | |
Footnotes | |
--------- | |
[a] expensive-to-modify list of missing objects: Earlier in the design of | |
partial clone we discussed the need for a single list of missing objects. | |
This would essentially be a sorted linear list of OIDs that the were | |
omitted by the server during a clone or subsequent fetches. | |
This file would need to be loaded into memory on every object lookup. | |
It would need to be read, updated, and re-written (like the .git/index) | |
on every explicit "git fetch" command *and* on any dynamic object fetch. | |
The cost to read, update, and write this file could add significant | |
overhead to every command if there are many missing objects. For example, | |
if there are 100M missing blobs, this file would be at least 2GiB on disk. | |
With the "promisor" concept, we *infer* a missing object based upon the | |
type of packfile that references it. | |
Related Links | |
------------- | |
[0] https://crbug.com/git/2 | |
Bug#2: Partial Clone | |
[1] https://lore.kernel.org/git/[email protected]/ + | |
Subject: [RFC] Add support for downloading blobs on demand + | |
Date: Fri, 13 Jan 2017 10:52:53 -0500 | |
[2] https://lore.kernel.org/git/[email protected]/ + | |
Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) + | |
Date: Fri, 29 Sep 2017 13:11:36 -0700 | |
[3] https://lore.kernel.org/git/[email protected]/ + | |
Subject: Proposal for missing blob support in Git repos + | |
Date: Wed, 26 Apr 2017 15:13:46 -0700 | |
[4] https://lore.kernel.org/git/[email protected]/ + | |
Subject: [PATCH 00/10] RFC Partial Clone and Fetch + | |
Date: Wed, 8 Mar 2017 18:50:29 +0000 | |
[5] https://lore.kernel.org/git/[email protected]/ + | |
Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module + | |
Date: Fri, 5 May 2017 11:27:52 -0400 | |
[6] https://lore.kernel.org/git/[email protected]/ + | |
Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand + | |
Date: Fri, 14 Jul 2017 09:26:50 -0400 | |