Spaces:
Running
Running
text-generation-webui
/
installer_files
/env
/share
/doc
/git
/technical
/hash-function-transition.txt
Git hash function transition | |
============================ | |
Objective | |
--------- | |
Migrate Git from SHA-1 to a stronger hash function. | |
Background | |
---------- | |
At its core, the Git version control system is a content addressable | |
filesystem. It uses the SHA-1 hash function to name content. For | |
example, files, directories, and revisions are referred to by hash | |
values unlike in other traditional version control systems where files | |
or versions are referred to via sequential numbers. The use of a hash | |
function to address its content delivers a few advantages: | |
* Integrity checking is easy. Bit flips, for example, are easily | |
detected, as the hash of corrupted content does not match its name. | |
* Lookup of objects is fast. | |
Using a cryptographically secure hash function brings additional | |
advantages: | |
* Object names can be signed and third parties can trust the hash to | |
address the signed object and all objects it references. | |
* Communication using Git protocol and out of band communication | |
methods have a short reliable string that can be used to reliably | |
address stored content. | |
Over time some flaws in SHA-1 have been discovered by security | |
researchers. On 23 February 2017 the SHAttered attack | |
(https://shattered.io) demonstrated a practical SHA-1 hash collision. | |
Git v2.13.0 and later subsequently moved to a hardened SHA-1 | |
implementation by default, which isn't vulnerable to the SHAttered | |
attack, but SHA-1 is still weak. | |
Thus it's considered prudent to move past any variant of SHA-1 | |
to a new hash. There's no guarantee that future attacks on SHA-1 won't | |
be published in the future, and those attacks may not have viable | |
mitigations. | |
If SHA-1 and its variants were to be truly broken, Git's hash function | |
could not be considered cryptographically secure any more. This would | |
impact the communication of hash values because we could not trust | |
that a given hash value represented the known good version of content | |
that the speaker intended. | |
SHA-1 still possesses the other properties such as fast object lookup | |
and safe error checking, but other hash functions are equally suitable | |
that are believed to be cryptographically secure. | |
Choice of Hash | |
-------------- | |
The hash to replace the hardened SHA-1 should be stronger than SHA-1 | |
was: we would like it to be trustworthy and useful in practice for at | |
least 10 years. | |
Some other relevant properties: | |
1. A 256-bit hash (long enough to match common security practice; not | |
excessively long to hurt performance and disk usage). | |
2. High quality implementations should be widely available (e.g., in | |
OpenSSL and Apple CommonCrypto). | |
3. The hash function's properties should match Git's needs (e.g. Git | |
requires collision and 2nd preimage resistance and does not require | |
length extension resistance). | |
4. As a tiebreaker, the hash should be fast to compute (fortunately | |
many contenders are faster than SHA-1). | |
There were several contenders for a successor hash to SHA-1, including | |
SHA-256, SHA-512/256, SHA-256x16, K12, and BLAKE2bp-256. | |
In late 2018 the project picked SHA-256 as its successor hash. | |
See 0ed8d8da374 (doc hash-function-transition: pick SHA-256 as | |
NewHash, 2018-08-04) and numerous mailing list threads at the time, | |
particularly the one starting at | |
https://lore.kernel.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/ | |
for more information. | |
Goals | |
----- | |
1. The transition to SHA-256 can be done one local repository at a time. | |
a. Requiring no action by any other party. | |
b. A SHA-256 repository can communicate with SHA-1 Git servers | |
(push/fetch). | |
c. Users can use SHA-1 and SHA-256 identifiers for objects | |
interchangeably (see "Object names on the command line", below). | |
d. New signed objects make use of a stronger hash function than | |
SHA-1 for their security guarantees. | |
2. Allow a complete transition away from SHA-1. | |
a. Local metadata for SHA-1 compatibility can be removed from a | |
repository if compatibility with SHA-1 is no longer needed. | |
3. Maintainability throughout the process. | |
a. The object format is kept simple and consistent. | |
b. Creation of a generalized repository conversion tool. | |
Non-Goals | |
--------- | |
1. Add SHA-256 support to Git protocol. This is valuable and the | |
logical next step but it is out of scope for this initial design. | |
2. Transparently improving the security of existing SHA-1 signed | |
objects. | |
3. Intermixing objects using multiple hash functions in a single | |
repository. | |
4. Taking the opportunity to fix other bugs in Git's formats and | |
protocols. | |
5. Shallow clones and fetches into a SHA-256 repository. (This will | |
change when we add SHA-256 support to Git protocol.) | |
6. Skip fetching some submodules of a project into a SHA-256 | |
repository. (This also depends on SHA-256 support in Git | |
protocol.) | |
Overview | |
-------- | |
We introduce a new repository format extension. Repositories with this | |
extension enabled use SHA-256 instead of SHA-1 to name their objects. | |
This affects both object names and object content -- both the names | |
of objects and all references to other objects within an object are | |
switched to the new hash function. | |
SHA-256 repositories cannot be read by older versions of Git. | |
Alongside the packfile, a SHA-256 repository stores a bidirectional | |
mapping between SHA-256 and SHA-1 object names. The mapping is generated | |
locally and can be verified using "git fsck". Object lookups use this | |
mapping to allow naming objects using either their SHA-1 and SHA-256 names | |
interchangeably. | |
"git cat-file" and "git hash-object" gain options to display an object | |
in its SHA-1 form and write an object given its SHA-1 form. This | |
requires all objects referenced by that object to be present in the | |
object database so that they can be named using the appropriate name | |
(using the bidirectional hash mapping). | |
Fetches from a SHA-1 based server convert the fetched objects into | |
SHA-256 form and record the mapping in the bidirectional mapping table | |
(see below for details). Pushes to a SHA-1 based server convert the | |
objects being pushed into SHA-1 form so the server does not have to be | |
aware of the hash function the client is using. | |
Detailed Design | |
--------------- | |
Repository format extension | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
A SHA-256 repository uses repository format version `1` (see | |
Documentation/technical/repository-version.txt) with extensions | |
`objectFormat` and `compatObjectFormat`: | |
[core] | |
repositoryFormatVersion = 1 | |
[extensions] | |
objectFormat = sha256 | |
compatObjectFormat = sha1 | |
The combination of setting `core.repositoryFormatVersion=1` and | |
populating `extensions.*` ensures that all versions of Git later than | |
`v0.99.9l` will die instead of trying to operate on the SHA-256 | |
repository, instead producing an error message. | |
# Between v0.99.9l and v2.7.0 | |
$ git status | |
fatal: Expected git repo version <= 0, found 1 | |
# After v2.7.0 | |
$ git status | |
fatal: unknown repository extensions found: | |
objectformat | |
compatobjectformat | |
See the "Transition plan" section below for more details on these | |
repository extensions. | |
Object names | |
~~~~~~~~~~~~ | |
Objects can be named by their 40 hexadecimal digit SHA-1 name or 64 | |
hexadecimal digit SHA-256 name, plus names derived from those (see | |
gitrevisions(7)). | |
The SHA-1 name of an object is the SHA-1 of the concatenation of its | |
type, length, a nul byte, and the object's SHA-1 content. This is the | |
traditional <sha1> used in Git to name objects. | |
The SHA-256 name of an object is the SHA-256 of the concatenation of its | |
type, length, a nul byte, and the object's SHA-256 content. | |
Object format | |
~~~~~~~~~~~~~ | |
The content as a byte sequence of a tag, commit, or tree object named | |
by SHA-1 and SHA-256 differ because an object named by SHA-256 name refers to | |
other objects by their SHA-256 names and an object named by SHA-1 name | |
refers to other objects by their SHA-1 names. | |
The SHA-256 content of an object is the same as its SHA-1 content, except | |
that objects referenced by the object are named using their SHA-256 names | |
instead of SHA-1 names. Because a blob object does not refer to any | |
other object, its SHA-1 content and SHA-256 content are the same. | |
The format allows round-trip conversion between SHA-256 content and | |
SHA-1 content. | |
Object storage | |
~~~~~~~~~~~~~~ | |
Loose objects use zlib compression and packed objects use the packed | |
format described in linkgit:gitformat-pack[5], just like | |
today. The content that is compressed and stored uses SHA-256 content | |
instead of SHA-1 content. | |
Pack index | |
~~~~~~~~~~ | |
Pack index (.idx) files use a new v3 format that supports multiple | |
hash functions. They have the following format (all integers are in | |
network byte order): | |
- A header appears at the beginning and consists of the following: | |
* The 4-byte pack index signature: '\377t0c' | |
* 4-byte version number: 3 | |
* 4-byte length of the header section, including the signature and | |
version number | |
* 4-byte number of objects contained in the pack | |
* 4-byte number of object formats in this pack index: 2 | |
* For each object format: | |
** 4-byte format identifier (e.g., 'sha1' for SHA-1) | |
** 4-byte length in bytes of shortened object names. This is the | |
shortest possible length needed to make names in the shortened | |
object name table unambiguous. | |
** 4-byte integer, recording where tables relating to this format | |
are stored in this index file, as an offset from the beginning. | |
* 4-byte offset to the trailer from the beginning of this file. | |
* Zero or more additional key/value pairs (4-byte key, 4-byte | |
value). Only one key is supported: 'PSRC'. See the "Loose objects | |
and unreachable objects" section for supported values and how this | |
is used. All other keys are reserved. Readers must ignore | |
unrecognized keys. | |
- Zero or more NUL bytes. This can optionally be used to improve the | |
alignment of the full object name table below. | |
- Tables for the first object format: | |
* A sorted table of shortened object names. These are prefixes of | |
the names of all objects in this pack file, packed together | |
without offset values to reduce the cache footprint of the binary | |
search for a specific object name. | |
* A table of full object names in pack order. This allows resolving | |
a reference to "the nth object in the pack file" (from a | |
reachability bitmap or from the next table of another object | |
format) to its object name. | |
* A table of 4-byte values mapping object name order to pack order. | |
For an object in the table of sorted shortened object names, the | |
value at the corresponding index in this table is the index in the | |
previous table for that same object. | |
This can be used to look up the object in reachability bitmaps or | |
to look up its name in another object format. | |
* A table of 4-byte CRC32 values of the packed object data, in the | |
order that the objects appear in the pack file. This is to allow | |
compressed data to be copied directly from pack to pack during | |
repacking without undetected data corruption. | |
* A table of 4-byte offset values. For an object in the table of | |
sorted shortened object names, the value at the corresponding | |
index in this table indicates where that object can be found in | |
the pack file. These are usually 31-bit pack file offsets, but | |
large offsets are encoded as an index into the next table with the | |
most significant bit set. | |
* A table of 8-byte offset entries (empty for pack files less than | |
2 GiB). Pack files are organized with heavily used objects toward | |
the front, so most object references should not need to refer to | |
this table. | |
- Zero or more NUL bytes. | |
- Tables for the second object format, with the same layout as above, | |
up to and not including the table of CRC32 values. | |
- Zero or more NUL bytes. | |
- The trailer consists of the following: | |
* A copy of the 20-byte SHA-256 checksum at the end of the | |
corresponding packfile. | |
* 20-byte SHA-256 checksum of all of the above. | |
Loose object index | |
~~~~~~~~~~~~~~~~~~ | |
A new file $GIT_OBJECT_DIR/loose-object-idx contains information about | |
all loose objects. Its format is | |
# loose-object-idx | |
(sha256-name SP sha1-name LF)* | |
where the object names are in hexadecimal format. The file is not | |
sorted. | |
The loose object index is protected against concurrent writes by a | |
lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose | |
object: | |
1. Write the loose object to a temporary file, like today. | |
2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock. | |
3. Rename the loose object into place. | |
4. Open loose-object-idx with O_APPEND and write the new object | |
5. Unlink loose-object-idx.lock to release the lock. | |
To remove entries (e.g. in "git pack-refs" or "git-prune"): | |
1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the | |
lock. | |
2. Write the new content to loose-object-idx.lock. | |
3. Unlink any loose objects being removed. | |
4. Rename to replace loose-object-idx, releasing the lock. | |
Translation table | |
~~~~~~~~~~~~~~~~~ | |
The index files support a bidirectional mapping between SHA-1 names | |
and SHA-256 names. The lookup proceeds similarly to ordinary object | |
lookups. For example, to convert a SHA-1 name to a SHA-256 name: | |
1. Look for the object in idx files. If a match is present in the | |
idx's sorted list of truncated SHA-1 names, then: | |
a. Read the corresponding entry in the SHA-1 name order to pack | |
name order mapping. | |
b. Read the corresponding entry in the full SHA-1 name table to | |
verify we found the right object. If it is, then | |
c. Read the corresponding entry in the full SHA-256 name table. | |
That is the object's SHA-256 name. | |
2. Check for a loose object. Read lines from loose-object-idx until | |
we find a match. | |
Step (1) takes the same amount of time as an ordinary object lookup: | |
O(number of packs * log(objects per pack)). Step (2) takes O(number of | |
loose objects) time. To maintain good performance it will be necessary | |
to keep the number of loose objects low. See the "Loose objects and | |
unreachable objects" section below for more details. | |
Since all operations that make new objects (e.g., "git commit") add | |
the new objects to the corresponding index, this mapping is possible | |
for all objects in the object store. | |
Reading an object's SHA-1 content | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
The SHA-1 content of an object can be read by converting all SHA-256 names | |
of its SHA-256 content references to SHA-1 names using the translation table. | |
Fetch | |
~~~~~ | |
Fetching from a SHA-1 based server requires translating between SHA-1 | |
and SHA-256 based representations on the fly. | |
SHA-1s named in the ref advertisement that are present on the client | |
can be translated to SHA-256 and looked up as local objects using the | |
translation table. | |
Negotiation proceeds as today. Any "have"s generated locally are | |
converted to SHA-1 before being sent to the server, and SHA-1s | |
mentioned by the server are converted to SHA-256 when looking them up | |
locally. | |
After negotiation, the server sends a packfile containing the | |
requested objects. We convert the packfile to SHA-256 format using | |
the following steps: | |
1. index-pack: inflate each object in the packfile and compute its | |
SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against | |
objects the client has locally. These objects can be looked up | |
using the translation table and their SHA-1 content read as | |
described above to resolve the deltas. | |
2. topological sort: starting at the "want"s from the negotiation | |
phase, walk through objects in the pack and emit a list of them, | |
excluding blobs, in reverse topologically sorted order, with each | |
object coming later in the list than all objects it references. | |
(This list only contains objects reachable from the "wants". If the | |
pack from the server contained additional extraneous objects, then | |
they will be discarded.) | |
3. convert to SHA-256: open a new SHA-256 packfile. Read the topologically | |
sorted list just generated. For each object, inflate its | |
SHA-1 content, convert to SHA-256 content, and write it to the SHA-256 | |
pack. Record the new SHA-1<-->SHA-256 mapping entry for use in the idx. | |
4. sort: reorder entries in the new pack to match the order of objects | |
in the pack the server generated and include blobs. Write a SHA-256 idx | |
file | |
5. clean up: remove the SHA-1 based pack file, index, and | |
topologically sorted list obtained from the server in steps 1 | |
and 2. | |
Step 3 requires every object referenced by the new object to be in the | |
translation table. This is why the topological sort step is necessary. | |
As an optimization, step 1 could write a file describing what non-blob | |
objects each object it has inflated from the packfile references. This | |
makes the topological sort in step 2 possible without inflating the | |
objects in the packfile for a second time. The objects need to be | |
inflated again in step 3, for a total of two inflations. | |
Step 4 is probably necessary for good read-time performance. "git | |
pack-objects" on the server optimizes the pack file for good data | |
locality (see Documentation/technical/pack-heuristics.txt). | |
Details of this process are likely to change. It will take some | |
experimenting to get this to perform well. | |
Push | |
~~~~ | |
Push is simpler than fetch because the objects referenced by the | |
pushed objects are already in the translation table. The SHA-1 content | |
of each object being pushed can be read as described in the "Reading | |
an object's SHA-1 content" section to generate the pack written by git | |
send-pack. | |
Signed Commits | |
~~~~~~~~~~~~~~ | |
We add a new field "gpgsig-sha256" to the commit object format to allow | |
signing commits without relying on SHA-1. It is similar to the | |
existing "gpgsig" field. Its signed payload is the SHA-256 content of the | |
commit object with any "gpgsig" and "gpgsig-sha256" fields removed. | |
This means commits can be signed | |
1. using SHA-1 only, as in existing signed commit objects | |
2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig | |
fields. | |
3. using only SHA-256, by only using the gpgsig-sha256 field. | |
Old versions of "git verify-commit" can verify the gpgsig signature in | |
cases (1) and (2) without modifications and view case (3) as an | |
ordinary unsigned commit. | |
Signed Tags | |
~~~~~~~~~~~ | |
We add a new field "gpgsig-sha256" to the tag object format to allow | |
signing tags without relying on SHA-1. Its signed payload is the | |
SHA-256 content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP | |
SIGNATURE-----" delimited in-body signature removed. | |
This means tags can be signed | |
1. using SHA-1 only, as in existing signed tag objects | |
2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body | |
signature. | |
3. using only SHA-256, by only using the gpgsig-sha256 field. | |
Mergetag embedding | |
~~~~~~~~~~~~~~~~~~ | |
The mergetag field in the SHA-1 content of a commit contains the | |
SHA-1 content of a tag that was merged by that commit. | |
The mergetag field in the SHA-256 content of the same commit contains the | |
SHA-256 content of the same tag. | |
Submodules | |
~~~~~~~~~~ | |
To convert recorded submodule pointers, you need to have the converted | |
submodule repository in place. The translation table of the submodule | |
can be used to look up the new hash. | |
Loose objects and unreachable objects | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
Fast lookups in the loose-object-idx require that the number of loose | |
objects not grow too high. | |
"git gc --auto" currently waits for there to be 6700 loose objects | |
present before consolidating them into a packfile. We will need to | |
measure to find a more appropriate threshold for it to use. | |
"git gc --auto" currently waits for there to be 50 packs present | |
before combining packfiles. Packing loose objects more aggressively | |
may cause the number of pack files to grow too quickly. This can be | |
mitigated by using a strategy similar to Martin Fick's exponential | |
rolling garbage collection script: | |
https://gerrit-review.googlesource.com/c/gerrit/+/35215 | |
"git gc" currently expels any unreachable objects it encounters in | |
pack files to loose objects in an attempt to prevent a race when | |
pruning them (in case another process is simultaneously writing a new | |
object that refers to the about-to-be-deleted object). This leads to | |
an explosion in the number of loose objects present and disk space | |
usage due to the objects in delta form being replaced with independent | |
loose objects. Worse, the race is still present for loose objects. | |
Instead, "git gc" will need to move unreachable objects to a new | |
packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see | |
below). To avoid the race when writing new objects referring to an | |
about-to-be-deleted object, code paths that write new objects will | |
need to copy any objects from UNREACHABLE_GARBAGE packs that they | |
refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects). | |
UNREACHABLE_GARBAGE are then safe to delete if their creation time (as | |
indicated by the file's mtime) is long enough ago. | |
To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be | |
combined under certain circumstances. If "gc.garbageTtl" is set to | |
greater than one day, then packs created within a single calendar day, | |
UTC, can be coalesced together. The resulting packfile would have an | |
mtime before midnight on that day, so this makes the effective maximum | |
ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day, | |
then we divide the calendar day into intervals one-third of that ttl | |
in duration. Packs created within the same interval can be coalesced | |
together. The resulting packfile would have an mtime before the end of | |
the interval, so this makes the effective maximum ttl equal to the | |
garbageTtl * 4/3. | |
This rule comes from Thirumala Reddy Mutchukota's JGit change | |
https://git.eclipse.org/r/90465. | |
The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack | |
index. More generally, that field indicates where a pack came from: | |
- 1 (PACK_SOURCE_RECEIVE) for a pack received over the network | |
- 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight | |
"gc --auto" operation | |
- 3 (PACK_SOURCE_GC) for a pack created by a full gc | |
- 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage | |
discovered by gc | |
- 5 (PACK_SOURCE_INSERT) for locally created objects that were | |
written directly to a pack file, e.g. from "git add ." | |
This information can be useful for debugging and for "gc --auto" to | |
make appropriate choices about which packs to coalesce. | |
Caveats | |
------- | |
Invalid objects | |
~~~~~~~~~~~~~~~ | |
The conversion from SHA-1 content to SHA-256 content retains any | |
brokenness in the original object (e.g., tree entry modes encoded with | |
leading 0, tree objects whose paths are not sorted correctly, and | |
commit objects without an author or committer). This is a deliberate | |
feature of the design to allow the conversion to round-trip. | |
More profoundly broken objects (e.g., a commit with a truncated "tree" | |
header line) cannot be converted but were not usable by current Git | |
anyway. | |
Shallow clone and submodules | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
Because it requires all referenced objects to be available in the | |
locally generated translation table, this design does not support | |
shallow clone or unfetched submodules. Protocol improvements might | |
allow lifting this restriction. | |
Alternates | |
~~~~~~~~~~ | |
For the same reason, a SHA-256 repository cannot borrow objects from a | |
SHA-1 repository using objects/info/alternates or | |
$GIT_ALTERNATE_OBJECT_REPOSITORIES. | |
git notes | |
~~~~~~~~~ | |
The "git notes" tool annotates objects using their SHA-1 name as key. | |
This design does not describe a way to migrate notes trees to use | |
SHA-256 names. That migration is expected to happen separately (for | |
example using a file at the root of the notes tree to describe which | |
hash it uses). | |
Server-side cost | |
~~~~~~~~~~~~~~~~ | |
Until Git protocol gains SHA-256 support, using SHA-256 based storage | |
on public-facing Git servers is strongly discouraged. Once Git | |
protocol gains SHA-256 support, SHA-256 based servers are likely not | |
to support SHA-1 compatibility, to avoid what may be a very expensive | |
hash re-encode during clone and to encourage peers to modernize. | |
The design described here allows fetches by SHA-1 clients of a | |
personal SHA-256 repository because it's not much more difficult than | |
allowing pushes from that repository. This support needs to be guarded | |
by a configuration option -- servers like git.kernel.org that serve a | |
large number of clients would not be expected to bear that cost. | |
Meaning of signatures | |
~~~~~~~~~~~~~~~~~~~~~ | |
The signed payload for signed commits and tags does not explicitly | |
name the hash used to identify objects. If some day Git adopts a new | |
hash function with the same length as the current SHA-1 (40 | |
hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the | |
intent behind the PGP signed payload in an object signature is | |
unclear: | |
object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 | |
type commit | |
tag v2.12.0 | |
tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800 | |
Git 2.12 | |
Does this mean Git v2.12.0 is the commit with SHA-1 name | |
e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with | |
new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? | |
Fortunately SHA-256 and SHA-1 have different lengths. If Git starts | |
using another hash with the same length to name objects, then it will | |
need to change the format of signed payloads using that hash to | |
address this issue. | |
Object names on the command line | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
To support the transition (see Transition plan below), this design | |
supports four different modes of operation: | |
1. ("dark launch") Treat object names input by the user as SHA-1 and | |
convert any object names written to output to SHA-1, but store | |
objects using SHA-256. This allows users to test the code with no | |
visible behavior change except for performance. This allows | |
running even tests that assume the SHA-1 hash function, to | |
sanity-check the behavior of the new mode. | |
2. ("early transition") Allow both SHA-1 and SHA-256 object names in | |
input. Any object names written to output use SHA-1. This allows | |
users to continue to make use of SHA-1 to communicate with peers | |
(e.g. by email) that have not migrated yet and prepares for mode 3. | |
3. ("late transition") Allow both SHA-1 and SHA-256 object names in | |
input. Any object names written to output use SHA-256. In this | |
mode, users are using a more secure object naming method by | |
default. The disruption is minimal as long as most of their peers | |
are in mode 2 or mode 3. | |
4. ("post-transition") Treat object names input by the user as | |
SHA-256 and write output using SHA-256. This is safer than mode 3 | |
because there is less risk that input is incorrectly interpreted | |
using the wrong hash function. | |
The mode is specified in configuration. | |
The user can also explicitly specify which format to use for a | |
particular revision specifier and for output, overriding the mode. For | |
example: | |
git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} | |
Transition plan | |
--------------- | |
Some initial steps can be implemented independently of one another: | |
- adding a hash function API (vtable) | |
- teaching fsck to tolerate the gpgsig-sha256 field | |
- excluding gpgsig-* from the fields copied by "git commit --amend" | |
- annotating tests that depend on SHA-1 values with a SHA1 test | |
prerequisite | |
- using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ | |
consistently instead of "unsigned char *" and the hardcoded | |
constants 20 and 40. | |
- introducing index v3 | |
- adding support for the PSRC field and safer object pruning | |
The first user-visible change is the introduction of the objectFormat | |
extension (without compatObjectFormat). This requires: | |
- teaching fsck about this mode of operation | |
- using the hash function API (vtable) when computing object names | |
- signing objects and verifying signatures | |
- rejecting attempts to fetch from or push to an incompatible | |
repository | |
Next comes introduction of compatObjectFormat: | |
- implementing the loose-object-idx | |
- translating object names between object formats | |
- translating object content between object formats | |
- generating and verifying signatures in the compat format | |
- adding appropriate index entries when adding a new object to the | |
object store | |
- --output-format option | |
- ^{sha1} and ^{sha256} revision notation | |
- configuration to specify default input and output format (see | |
"Object names on the command line" above) | |
The next step is supporting fetches and pushes to SHA-1 repositories: | |
- allow pushes to a repository using the compat format | |
- generate a topologically sorted list of the SHA-1 names of fetched | |
objects | |
- convert the fetched packfile to SHA-256 format and generate an idx | |
file | |
- re-sort to match the order of objects in the fetched packfile | |
The infrastructure supporting fetch also allows converting an existing | |
repository. In converted repositories and new clones, end users can | |
gain support for the new hash function without any visible change in | |
behavior (see "dark launch" in the "Object names on the command line" | |
section). In particular this allows users to verify SHA-256 signatures | |
on objects in the repository, and it should ensure the transition code | |
is stable in production in preparation for using it more widely. | |
Over time projects would encourage their users to adopt the "early | |
transition" and then "late transition" modes to take advantage of the | |
new, more futureproof SHA-256 object names. | |
When objectFormat and compatObjectFormat are both set, commands | |
generating signatures would generate both SHA-1 and SHA-256 signatures | |
by default to support both new and old users. | |
In projects using SHA-256 heavily, users could be encouraged to adopt | |
the "post-transition" mode to avoid accidentally making implicit use | |
of SHA-1 object names. | |
Once a critical mass of users have upgraded to a version of Git that | |
can verify SHA-256 signatures and have converted their existing | |
repositories to support verifying them, we can add support for a | |
setting to generate only SHA-256 signatures. This is expected to be at | |
least a year later. | |
That is also a good moment to advertise the ability to convert | |
repositories to use SHA-256 only, stripping out all SHA-1 related | |
metadata. This improves performance by eliminating translation | |
overhead and security by avoiding the possibility of accidentally | |
relying on the safety of SHA-1. | |
Updating Git's protocols to allow a server to specify which hash | |
functions it supports is also an important part of this transition. It | |
is not discussed in detail in this document but this transition plan | |
assumes it happens. :) | |
Alternatives considered | |
----------------------- | |
Upgrading everyone working on a particular project on a flag day | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
Projects like the Linux kernel are large and complex enough that | |
flipping the switch for all projects based on the repository at once | |
is infeasible. | |
Not only would all developers and server operators supporting | |
developers have to switch on the same flag day, but supporting tooling | |
(continuous integration, code review, bug trackers, etc) would have to | |
be adapted as well. This also makes it difficult to get early feedback | |
from some project participants testing before it is time for mass | |
adoption. | |
Using hash functions in parallel | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
(e.g. https://lore.kernel.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ ) | |
Objects newly created would be addressed by the new hash, but inside | |
such an object (e.g. commit) it is still possible to address objects | |
using the old hash function. | |
* You cannot trust its history (needed for bisectability) in the | |
future without further work | |
* Maintenance burden as the number of supported hash functions grows | |
(they will never go away, so they accumulate). In this proposal, by | |
comparison, converted objects lose all references to SHA-1. | |
Signed objects with multiple hashes | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
Instead of introducing the gpgsig-sha256 field in commit and tag objects | |
for SHA-256 content based signatures, an earlier version of this design | |
added "hash sha256 <SHA-256 name>" fields to strengthen the existing | |
SHA-1 content based signatures. | |
In other words, a single signature was used to attest to the object | |
content using both hash functions. This had some advantages: | |
* Using one signature instead of two speeds up the signing process. | |
* Having one signed payload with both hashes allows the signer to | |
attest to the SHA-1 name and SHA-256 name referring to the same object. | |
* All users consume the same signature. Broken signatures are likely | |
to be detected quickly using current versions of git. | |
However, it also came with disadvantages: | |
* Verifying a signed object requires access to the SHA-1 names of all | |
objects it references, even after the transition is complete and | |
translation table is no longer needed for anything else. To support | |
this, the design added fields such as "hash sha1 tree <SHA-1 name>" | |
and "hash sha1 parent <SHA-1 name>" to the SHA-256 content of a signed | |
commit, complicating the conversion process. | |
* Allowing signed objects without a SHA-1 (for after the transition is | |
complete) complicated the design further, requiring a "nohash sha1" | |
field to suppress including "hash sha1" fields in the SHA-256 content | |
and signed payload. | |
Lazily populated translation table | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
Some of the work of building the translation table could be deferred to | |
push time, but that would significantly complicate and slow down pushes. | |
Calculating the SHA-1 name at object creation time at the same time it is | |
being streamed to disk and having its SHA-256 name calculated should be | |
an acceptable cost. | |
Document History | |
---------------- | |
2017-03-03 | |
bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com, | |
sbeller@google.com | |
* Initial version sent to https://lore.kernel.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com | |
2017-03-03 jrnieder@gmail.com | |
Incorporated suggestions from jonathantanmy and sbeller: | |
* Describe purpose of signed objects with each hash type | |
* Redefine signed object verification using object content under the | |
first hash function | |
2017-03-06 jrnieder@gmail.com | |
* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2] | |
* Make SHA3-based signatures a separate field, avoiding the need for | |
"hash" and "nohash" fields (thanks to peff[3]). | |
* Add a sorting phase to fetch (thanks to Junio for noticing the need | |
for this). | |
* Omit blobs from the topological sort during fetch (thanks to peff). | |
* Discuss alternates, git notes, and git servers in the caveats | |
section (thanks to Junio Hamano, brian m. carlson[4], and Shawn | |
Pearce). | |
* Clarify language throughout (thanks to various commenters, | |
especially Junio). | |
2017-09-27 jrnieder@gmail.com, sbeller@google.com | |
* Use placeholder NewHash instead of SHA3-256 | |
* Describe criteria for picking a hash function. | |
* Include a transition plan (thanks especially to Brandon Williams | |
for fleshing these ideas out) | |
* Define the translation table (thanks, Shawn Pearce[5], Jonathan | |
Tan, and Masaya Suzuki) | |
* Avoid loose object overhead by packing more aggressively in | |
"git gc --auto" | |
Later history: | |
* See the history of this file in git.git for the history of subsequent | |
edits. This document history is no longer being maintained as it | |
would now be superfluous to the commit log | |
References: | |
[1] https://lore.kernel.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/ | |
[2] https://lore.kernel.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/ | |
[3] https://lore.kernel.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/ | |
[4] https://lore.kernel.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net | |
[5] https://lore.kernel.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/ | |