Spaces:
Running
Running
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> | |
<head> | |
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" /> | |
<meta name="generator" content="AsciiDoc 10.2.0" /> | |
<title>Concerning Git’s Packing Heuristics</title> | |
<style type="text/css"> | |
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */ | |
/* Default font. */ | |
body { | |
font-family: Georgia,serif; | |
} | |
/* Title font. */ | |
h1, h2, h3, h4, h5, h6, | |
div.title, caption.title, | |
thead, p.table.header, | |
#toctitle, | |
#author, #revnumber, #revdate, #revremark, | |
#footer { | |
font-family: Arial,Helvetica,sans-serif; | |
} | |
body { | |
margin: 1em 5% 1em 5%; | |
} | |
a { | |
color: blue; | |
text-decoration: underline; | |
} | |
a:visited { | |
color: fuchsia; | |
} | |
em { | |
font-style: italic; | |
color: navy; | |
} | |
strong { | |
font-weight: bold; | |
color: #083194; | |
} | |
h1, h2, h3, h4, h5, h6 { | |
color: #527bbd; | |
margin-top: 1.2em; | |
margin-bottom: 0.5em; | |
line-height: 1.3; | |
} | |
h1, h2, h3 { | |
border-bottom: 2px solid silver; | |
} | |
h2 { | |
padding-top: 0.5em; | |
} | |
h3 { | |
float: left; | |
} | |
h3 + * { | |
clear: left; | |
} | |
h5 { | |
font-size: 1.0em; | |
} | |
div.sectionbody { | |
margin-left: 0; | |
} | |
hr { | |
border: 1px solid silver; | |
} | |
p { | |
margin-top: 0.5em; | |
margin-bottom: 0.5em; | |
} | |
ul, ol, li > p { | |
margin-top: 0; | |
} | |
ul > li { color: #aaa; } | |
ul > li > * { color: black; } | |
.monospaced, code, pre { | |
font-family: "Courier New", Courier, monospace; | |
font-size: inherit; | |
color: navy; | |
padding: 0; | |
margin: 0; | |
} | |
pre { | |
white-space: pre-wrap; | |
} | |
#author { | |
color: #527bbd; | |
font-weight: bold; | |
font-size: 1.1em; | |
} | |
#email { | |
} | |
#revnumber, #revdate, #revremark { | |
} | |
#footer { | |
font-size: small; | |
border-top: 2px solid silver; | |
padding-top: 0.5em; | |
margin-top: 4.0em; | |
} | |
#footer-text { | |
float: left; | |
padding-bottom: 0.5em; | |
} | |
#footer-badges { | |
float: right; | |
padding-bottom: 0.5em; | |
} | |
#preamble { | |
margin-top: 1.5em; | |
margin-bottom: 1.5em; | |
} | |
div.imageblock, div.exampleblock, div.verseblock, | |
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock, | |
div.admonitionblock { | |
margin-top: 1.0em; | |
margin-bottom: 1.5em; | |
} | |
div.admonitionblock { | |
margin-top: 2.0em; | |
margin-bottom: 2.0em; | |
margin-right: 10%; | |
color: #606060; | |
} | |
div.content { /* Block element content. */ | |
padding: 0; | |
} | |
/* Block element titles. */ | |
div.title, caption.title { | |
color: #527bbd; | |
font-weight: bold; | |
text-align: left; | |
margin-top: 1.0em; | |
margin-bottom: 0.5em; | |
} | |
div.title + * { | |
margin-top: 0; | |
} | |
td div.title:first-child { | |
margin-top: 0.0em; | |
} | |
div.content div.title:first-child { | |
margin-top: 0.0em; | |
} | |
div.content + div.title { | |
margin-top: 0.0em; | |
} | |
div.sidebarblock > div.content { | |
background: #ffffee; | |
border: 1px solid #dddddd; | |
border-left: 4px solid #f0f0f0; | |
padding: 0.5em; | |
} | |
div.listingblock > div.content { | |
border: 1px solid #dddddd; | |
border-left: 5px solid #f0f0f0; | |
background: #f8f8f8; | |
padding: 0.5em; | |
} | |
div.quoteblock, div.verseblock { | |
padding-left: 1.0em; | |
margin-left: 1.0em; | |
margin-right: 10%; | |
border-left: 5px solid #f0f0f0; | |
color: #888; | |
} | |
div.quoteblock > div.attribution { | |
padding-top: 0.5em; | |
text-align: right; | |
} | |
div.verseblock > pre.content { | |
font-family: inherit; | |
font-size: inherit; | |
} | |
div.verseblock > div.attribution { | |
padding-top: 0.75em; | |
text-align: left; | |
} | |
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */ | |
div.verseblock + div.attribution { | |
text-align: left; | |
} | |
div.admonitionblock .icon { | |
vertical-align: top; | |
font-size: 1.1em; | |
font-weight: bold; | |
text-decoration: underline; | |
color: #527bbd; | |
padding-right: 0.5em; | |
} | |
div.admonitionblock td.content { | |
padding-left: 0.5em; | |
border-left: 3px solid #dddddd; | |
} | |
div.exampleblock > div.content { | |
border-left: 3px solid #dddddd; | |
padding-left: 0.5em; | |
} | |
div.imageblock div.content { padding-left: 0; } | |
span.image img { border-style: none; vertical-align: text-bottom; } | |
a.image:visited { color: white; } | |
dl { | |
margin-top: 0.8em; | |
margin-bottom: 0.8em; | |
} | |
dt { | |
margin-top: 0.5em; | |
margin-bottom: 0; | |
font-style: normal; | |
color: navy; | |
} | |
dd > *:first-child { | |
margin-top: 0.1em; | |
} | |
ul, ol { | |
list-style-position: outside; | |
} | |
ol.arabic { | |
list-style-type: decimal; | |
} | |
ol.loweralpha { | |
list-style-type: lower-alpha; | |
} | |
ol.upperalpha { | |
list-style-type: upper-alpha; | |
} | |
ol.lowerroman { | |
list-style-type: lower-roman; | |
} | |
ol.upperroman { | |
list-style-type: upper-roman; | |
} | |
div.compact ul, div.compact ol, | |
div.compact p, div.compact p, | |
div.compact div, div.compact div { | |
margin-top: 0.1em; | |
margin-bottom: 0.1em; | |
} | |
tfoot { | |
font-weight: bold; | |
} | |
td > div.verse { | |
white-space: pre; | |
} | |
div.hdlist { | |
margin-top: 0.8em; | |
margin-bottom: 0.8em; | |
} | |
div.hdlist tr { | |
padding-bottom: 15px; | |
} | |
dt.hdlist1.strong, td.hdlist1.strong { | |
font-weight: bold; | |
} | |
td.hdlist1 { | |
vertical-align: top; | |
font-style: normal; | |
padding-right: 0.8em; | |
color: navy; | |
} | |
td.hdlist2 { | |
vertical-align: top; | |
} | |
div.hdlist.compact tr { | |
margin: 0; | |
padding-bottom: 0; | |
} | |
.comment { | |
background: yellow; | |
} | |
.footnote, .footnoteref { | |
font-size: 0.8em; | |
} | |
span.footnote, span.footnoteref { | |
vertical-align: super; | |
} | |
#footnotes { | |
margin: 20px 0 20px 0; | |
padding: 7px 0 0 0; | |
} | |
#footnotes div.footnote { | |
margin: 0 0 5px 0; | |
} | |
#footnotes hr { | |
border: none; | |
border-top: 1px solid silver; | |
height: 1px; | |
text-align: left; | |
margin-left: 0; | |
width: 20%; | |
min-width: 100px; | |
} | |
div.colist td { | |
padding-right: 0.5em; | |
padding-bottom: 0.3em; | |
vertical-align: top; | |
} | |
div.colist td img { | |
margin-top: 0.3em; | |
} | |
@media print { | |
#footer-badges { display: none; } | |
} | |
#toc { | |
margin-bottom: 2.5em; | |
} | |
#toctitle { | |
color: #527bbd; | |
font-size: 1.1em; | |
font-weight: bold; | |
margin-top: 1.0em; | |
margin-bottom: 0.1em; | |
} | |
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 { | |
margin-top: 0; | |
margin-bottom: 0; | |
} | |
div.toclevel2 { | |
margin-left: 2em; | |
font-size: 0.9em; | |
} | |
div.toclevel3 { | |
margin-left: 4em; | |
font-size: 0.9em; | |
} | |
div.toclevel4 { | |
margin-left: 6em; | |
font-size: 0.9em; | |
} | |
span.aqua { color: aqua; } | |
span.black { color: black; } | |
span.blue { color: blue; } | |
span.fuchsia { color: fuchsia; } | |
span.gray { color: gray; } | |
span.green { color: green; } | |
span.lime { color: lime; } | |
span.maroon { color: maroon; } | |
span.navy { color: navy; } | |
span.olive { color: olive; } | |
span.purple { color: purple; } | |
span.red { color: red; } | |
span.silver { color: silver; } | |
span.teal { color: teal; } | |
span.white { color: white; } | |
span.yellow { color: yellow; } | |
span.aqua-background { background: aqua; } | |
span.black-background { background: black; } | |
span.blue-background { background: blue; } | |
span.fuchsia-background { background: fuchsia; } | |
span.gray-background { background: gray; } | |
span.green-background { background: green; } | |
span.lime-background { background: lime; } | |
span.maroon-background { background: maroon; } | |
span.navy-background { background: navy; } | |
span.olive-background { background: olive; } | |
span.purple-background { background: purple; } | |
span.red-background { background: red; } | |
span.silver-background { background: silver; } | |
span.teal-background { background: teal; } | |
span.white-background { background: white; } | |
span.yellow-background { background: yellow; } | |
span.big { font-size: 2em; } | |
span.small { font-size: 0.6em; } | |
span.underline { text-decoration: underline; } | |
span.overline { text-decoration: overline; } | |
span.line-through { text-decoration: line-through; } | |
div.unbreakable { page-break-inside: avoid; } | |
/* | |
* xhtml11 specific | |
* | |
* */ | |
div.tableblock { | |
margin-top: 1.0em; | |
margin-bottom: 1.5em; | |
} | |
div.tableblock > table { | |
border: 3px solid #527bbd; | |
} | |
thead, p.table.header { | |
font-weight: bold; | |
color: #527bbd; | |
} | |
p.table { | |
margin-top: 0; | |
} | |
/* Because the table frame attribute is overridden by CSS in most browsers. */ | |
div.tableblock > table[frame="void"] { | |
border-style: none; | |
} | |
div.tableblock > table[frame="hsides"] { | |
border-left-style: none; | |
border-right-style: none; | |
} | |
div.tableblock > table[frame="vsides"] { | |
border-top-style: none; | |
border-bottom-style: none; | |
} | |
/* | |
* html5 specific | |
* | |
* */ | |
table.tableblock { | |
margin-top: 1.0em; | |
margin-bottom: 1.5em; | |
} | |
thead, p.tableblock.header { | |
font-weight: bold; | |
color: #527bbd; | |
} | |
p.tableblock { | |
margin-top: 0; | |
} | |
table.tableblock { | |
border-width: 3px; | |
border-spacing: 0px; | |
border-style: solid; | |
border-color: #527bbd; | |
border-collapse: collapse; | |
} | |
th.tableblock, td.tableblock { | |
border-width: 1px; | |
padding: 4px; | |
border-style: solid; | |
border-color: #527bbd; | |
} | |
table.tableblock.frame-topbot { | |
border-left-style: hidden; | |
border-right-style: hidden; | |
} | |
table.tableblock.frame-sides { | |
border-top-style: hidden; | |
border-bottom-style: hidden; | |
} | |
table.tableblock.frame-none { | |
border-style: hidden; | |
} | |
th.tableblock.halign-left, td.tableblock.halign-left { | |
text-align: left; | |
} | |
th.tableblock.halign-center, td.tableblock.halign-center { | |
text-align: center; | |
} | |
th.tableblock.halign-right, td.tableblock.halign-right { | |
text-align: right; | |
} | |
th.tableblock.valign-top, td.tableblock.valign-top { | |
vertical-align: top; | |
} | |
th.tableblock.valign-middle, td.tableblock.valign-middle { | |
vertical-align: middle; | |
} | |
th.tableblock.valign-bottom, td.tableblock.valign-bottom { | |
vertical-align: bottom; | |
} | |
/* | |
* manpage specific | |
* | |
* */ | |
body.manpage h1 { | |
padding-top: 0.5em; | |
padding-bottom: 0.5em; | |
border-top: 2px solid silver; | |
border-bottom: 2px solid silver; | |
} | |
body.manpage h2 { | |
border-style: none; | |
} | |
body.manpage div.sectionbody { | |
margin-left: 3em; | |
} | |
@media print { | |
body.manpage div#toc { display: none; } | |
} | |
</style> | |
<script type="text/javascript"> | |
/*<+'])'); | |
// Function that scans the DOM tree for header elements (the DOM2 | |
// nodeIterator API would be a better technique but not supported by all | |
// browsers). | |
var iterate = function (el) { | |
for (var i = el.firstChild; i != null; i = i.nextSibling) { | |
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) { | |
var mo = re.exec(i.tagName); | |
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") { | |
result[result.length] = new TocEntry(i, getText(i), mo[1]-1); | |
} | |
iterate(i); | |
} | |
} | |
} | |
iterate(el); | |
return result; | |
} | |
var toc = document.getElementById("toc"); | |
if (!toc) { | |
return; | |
} | |
// Delete existing TOC entries in case we're reloading the TOC. | |
var tocEntriesToRemove = []; | |
var i; | |
for (i = 0; i < toc.childNodes.length; i++) { | |
var entry = toc.childNodes[i]; | |
if (entry.nodeName.toLowerCase() == 'div' | |
&& entry.getAttribute("class") | |
&& entry.getAttribute("class").match(/^toclevel/)) | |
tocEntriesToRemove.push(entry); | |
} | |
for (i = 0; i < tocEntriesToRemove.length; i++) { | |
toc.removeChild(tocEntriesToRemove[i]); | |
} | |
// Rebuild TOC entries. | |
var entries = tocEntries(document.getElementById("content"), toclevels); | |
for (var i = 0; i < entries.length; ++i) { | |
var entry = entries[i]; | |
if (entry.element.id == "") | |
entry.element.id = "_toc_" + i; | |
var a = document.createElement("a"); | |
a.href = "#" + entry.element.id; | |
a.appendChild(document.createTextNode(entry.text)); | |
var div = document.createElement("div"); | |
div.appendChild(a); | |
div.className = "toclevel" + entry.toclevel; | |
toc.appendChild(div); | |
} | |
if (entries.length == 0) | |
toc.parentNode.removeChild(toc); | |
}, | |
///////////////////////////////////////////////////////////////////// | |
// Footnotes generator | |
///////////////////////////////////////////////////////////////////// | |
/* Based on footnote generation code from: | |
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html | |
*/ | |
footnotes: function () { | |
// Delete existing footnote entries in case we're reloading the footnodes. | |
var i; | |
var noteholder = document.getElementById("footnotes"); | |
if (!noteholder) { | |
return; | |
} | |
var entriesToRemove = []; | |
for (i = 0; i < noteholder.childNodes.length; i++) { | |
var entry = noteholder.childNodes[i]; | |
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote") | |
entriesToRemove.push(entry); | |
} | |
for (i = 0; i < entriesToRemove.length; i++) { | |
noteholder.removeChild(entriesToRemove[i]); | |
} | |
// Rebuild footnote entries. | |
var cont = document.getElementById("content"); | |
var spans = cont.getElementsByTagName("span"); | |
var refs = {}; | |
var n = 0; | |
for (i=0; i<spans.length; i++) { | |
if (spans[i].className == "footnote") { | |
n++; | |
var note = spans[i].getAttribute("data-note"); | |
if (!note) { | |
// Use [\s\S] in place of . so multi-line matches work. | |
// Because JavaScript has no s (dotall) regex flag. | |
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1]; | |
spans[i].innerHTML = | |
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n + | |
"' title='View footnote' class='footnote'>" + n + "</a>]"; | |
spans[i].setAttribute("data-note", note); | |
} | |
noteholder.innerHTML += | |
"<div class='footnote' id='_footnote_" + n + "'>" + | |
"<a href='#_footnoteref_" + n + "' title='Return to text'>" + | |
n + "</a>. " + note + "</div>"; | |
var id =spans[i].getAttribute("id"); | |
if (id != null) refs["#"+id] = n; | |
} | |
} | |
if (n == 0) | |
noteholder.parentNode.removeChild(noteholder); | |
else { | |
// Process footnoterefs. | |
for (i=0; i<spans.length; i++) { | |
if (spans[i].className == "footnoteref") { | |
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href"); | |
href = href.match(/#.*/)[0]; // Because IE return full URL. | |
n = refs[href]; | |
spans[i].innerHTML = | |
"[<a href='#_footnote_" + n + | |
"' title='View footnote' class='footnote'>" + n + "</a>]"; | |
} | |
} | |
} | |
}, | |
install: function(toclevels) { | |
var timerId; | |
function reinstall() { | |
asciidoc.footnotes(); | |
if (toclevels) { | |
asciidoc.toc(toclevels); | |
} | |
} | |
function reinstallAndRemoveTimer() { | |
clearInterval(timerId); | |
reinstall(); | |
} | |
timerId = setInterval(reinstall, 500); | |
if (document.addEventListener) | |
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false); | |
else | |
window.onload = reinstallAndRemoveTimer; | |
} | |
} | |
asciidoc.install(); | |
/*]]>*/ | |
</script> | |
</head> | |
<body class="article"> | |
<div id="header"> | |
<h1>Concerning Git’s Packing Heuristics</h1> | |
</div> | |
<div id="content"> | |
<div id="preamble"> | |
<div class="sectionbody"> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>Oh, here's a really stupid question:</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code> Where do I go | |
to learn the details | |
of Git's packing heuristics?</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Be careful what you ask!</p></div> | |
<div class="paragraph"><p>Followers of the Git, please open the Git IRC Log and turn to | |
February 10, 2006.</p></div> | |
<div class="paragraph"><p>It’s a rare occasion, and we are joined by the King Git Himself, | |
Linus Torvalds (linus). Nathaniel Smith, (njs`), has the floor | |
and seeks enlightenment. Others are present, but silent.</p></div> | |
<div class="paragraph"><p>Let’s listen in!</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> Oh, here's a really stupid question -- where do I go to | |
learn the details of Git's packing heuristics? google avails | |
me not, reading the source didn't help a lot, and wading | |
through the whole mailing list seems less efficient than any | |
of that.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>It is a bold start! A plea for help combined with a simultaneous | |
tri-part attack on some of the tried and true mainstays in the quest | |
for enlightenment. Brash accusations of google being useless. Hubris! | |
Maligning the source. Heresy! Disdain for the mailing list archives. | |
Woe.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><pasky> yes, the packing-related delta stuff is somewhat | |
mysterious even for me ;)</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Ah! Modesty after all.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> njs, I don't think the docs exist. That's something where | |
I don't think anybody else than me even really got involved. | |
Most of the rest of Git others have been busy with (especially | |
Junio), but packing nobody touched after I did it.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>It’s cryptic, yet vague. Linus in style for sure. Wise men | |
interpret this as an apology. A few argue it is merely a | |
statement of fact.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> I guess the next step is "read the source again", but I | |
have to build up a certain level of gumption first :-)</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Indeed! On both points.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> The packing heuristic is actually really really simple.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Bait…</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> But strange.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>And switch. That ought to do it!</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> Remember: Git really doesn't follow files. So what it does is | |
- generate a list of all objects | |
- sort the list according to magic heuristics | |
- walk the list, using a sliding window, seeing if an object | |
can be diffed against another object in the window | |
- write out the list in recency order</code></pre> | |
</div></div> | |
<div class="paragraph"><p>The traditional understatement:</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> I suspect that what I'm missing is the precise definition of | |
the word "magic"</code></pre> | |
</div></div> | |
<div class="paragraph"><p>The traditional insight:</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><pasky> yes</code></pre> | |
</div></div> | |
<div class="paragraph"><p>And Babel-like confusion flowed.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> oh, hmm, and I'm not sure what this sliding window means either</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><pasky> iirc, it appeared to me to be just the sha1 of the object | |
when reading the code casually ...</code></pre> | |
</div></div> | |
<div class="olist lowerroman"><ol class="lowerroman"> | |
<li> | |
<p> | |
which simply doesn’t sound as a very good heuristics, though ;) | |
</p> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> .....and recency order. okay, I think it's clear I didn't | |
even realize how much I wasn't realizing :-)</code></pre> | |
</div></div> | |
</li> | |
</ol></div> | |
<div class="paragraph"><p>Ah, grasshopper! And thus the enlightenment begins anew.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> The "magic" is actually in theory totally arbitrary. | |
ANY order will give you a working pack, but no, it's not | |
ordered by SHA-1.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>Before talking about the ordering for the sliding delta | |
window, let's talk about the recency order. That's more | |
important in one way.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> Right, but if all you want is a working way to pack things | |
together, you could just use cat and save yourself some | |
trouble...</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Waaait for it….</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> The recency ordering (which is basically: put objects | |
_physically_ into the pack in the order that they are | |
"reachable" from the head) is important.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> okay</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> It's important because that's the thing that gives packs | |
good locality. It keeps the objects close to the head (whether | |
they are old or new, but they are _reachable_ from the head) | |
at the head of the pack. So packs actually have absolutely | |
_wonderful_ IO patterns.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Read that again, because it is important.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> But recency ordering is totally useless for deciding how | |
to actually generate the deltas, so the delta ordering is | |
something else.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>The delta ordering is (wait for it): | |
- first sort by the "basename" of the object, as defined by | |
the name the object was _first_ reached through when | |
generating the object list | |
- within the same basename, sort by size of the object | |
- but always sort different types separately (commits first).</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>That's not exactly it, but it's very close.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> The "_first_ reached" thing is not too important, just you | |
need some way to break ties since the same objects may be | |
reachable many ways, yes?</code></pre> | |
</div></div> | |
<div class="paragraph"><p>And as if to clarify:</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> The point is that it's all really just any random | |
heuristic, and the ordering is totally unimportant for | |
correctness, but it helps a lot if the heuristic gives | |
"clumping" for things that are likely to delta well against | |
each other.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>It is an important point, so secretly, I did my own research and have | |
included my results below. To be fair, it has changed some over time. | |
And through the magic of Revisionistic History, I draw upon this entry | |
from The Git IRC Logs on my father’s birthday, March 1:</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><gitster> The quote from the above linus should be rewritten a | |
bit (wait for it): | |
- first sort by type. Different objects never delta with | |
each other. | |
- then sort by filename/dirname. hash of the basename | |
occupies the top BITS_PER_INT-DIR_BITS bits, and bottom | |
DIR_BITS are for the hash of leading path elements. | |
- then if we are doing "thin" pack, the objects we are _not_ | |
going to pack but we know about are sorted earlier than | |
other objects. | |
- and finally sort by size, larger to smaller.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>In one swell-foop, clarification and obscurification! Nonetheless, | |
authoritative. Cryptic, yet concise. It even solicits notions of | |
quotes from The Source Code. Clearly, more study is needed.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><gitster> That's the sort order. What this means is: | |
- we do not delta different object types. | |
- we prefer to delta the objects with the same full path, but | |
allow files with the same name from different directories. | |
- we always prefer to delta against objects we are not going | |
to send, if there are some. | |
- we prefer to delta against larger objects, so that we have | |
lots of removals.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>The penultimate rule is for "thin" packs. It is used when | |
the other side is known to have such objects.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>There it is again. "Thin" packs. I’m thinking to myself, "What | |
is a <em>thin</em> pack?" So I ask:</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><jdl> What is a "thin" pack?</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><gitster> Use of --objects-edge to rev-list as the upstream of | |
pack-objects. The pack transfer protocol negotiates that.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Woo hoo! Cleared that <em>right</em> up!</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><gitster> There are two directions - push and fetch.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>There! Did you see it? It is not <em>"push" and "pull"</em>! How often the | |
confusion has started here. So casually mentioned, too!</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><gitster> For push, git-send-pack invokes git-receive-pack on the | |
other end. The receive-pack says "I have up to these commits". | |
send-pack looks at them, and computes what are missing from | |
the other end. So "thin" could be the default there.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>In the other direction, fetch, git-fetch-pack and | |
git-clone-pack invokes git-upload-pack on the other end | |
(via ssh or by talking to the daemon).</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>There are two cases: fetch-pack with -k and clone-pack is one, | |
fetch-pack without -k is the other. clone-pack and fetch-pack | |
with -k will keep the downloaded packfile without expanded, so | |
we do not use thin pack transfer. Otherwise, the generated | |
pack will have delta without base object in the same pack.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>But fetch-pack without -k will explode the received pack into | |
individual objects, so we automatically ask upload-pack to | |
give us a thin pack if upload-pack supports it.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>OK then.</p></div> | |
<div class="paragraph"><p>Uh.</p></div> | |
<div class="paragraph"><p>Let’s return to the previous conversation still in progress.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> and "basename" means something like "the tail of end of | |
path of file objects and dir objects, as per basename(3), and | |
we just declare all commit and tag objects to have the same | |
basename" or something?</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Luckily, that too is a point that gitster clarified for us!</p></div> | |
<div class="paragraph"><p>If I might add, the trick is to make files that <em>might</em> be similar be | |
located close to each other in the hash buckets based on their file | |
names. It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and | |
"Makefile" all landed in the same bucket due to their common basename, | |
"Makefile". However, now they land in "close" buckets.</p></div> | |
<div class="paragraph"><p>The algorithm allows not just for the <em>same</em> bucket, but for <em>close</em> | |
buckets to be considered delta candidates. The rationale is | |
essentially that files, like Makefiles, often have very similar | |
content no matter what directory they live in.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> I played around with different delta algorithms, and with | |
making the "delta window" bigger, but having too big of a | |
sliding window makes it very expensive to generate the pack: | |
you need to compare every object with a _ton_ of other objects.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>There are a number of other trivial heuristics too, which | |
basically boil down to "don't bother even trying to delta this | |
pair" if we can tell before-hand that the delta isn't worth it | |
(due to size differences, where we can take a previous delta | |
result into account to decide that "ok, no point in trying | |
that one, it will be worse").</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>End result: packing is actually very size efficient. It's | |
somewhat CPU-wasteful, but on the other hand, since you're | |
really only supposed to do it maybe once a month (and you can | |
do it during the night), nobody really seems to care.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Nice Engineering Touch, there. Find when it doesn’t matter, and | |
proclaim it a non-issue. Good style too!</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> So, just to repeat to see if I'm following, we start by | |
getting a list of the objects we want to pack, we sort it by | |
this heuristic (basically lexicographically on the tuple | |
(type, basename, size)).</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>Then we walk through this list, and calculate a delta of | |
each object against the last n (tunable parameter) objects, | |
and pick the smallest of these deltas.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Vastly simplified, but the essence is there!</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> Correct.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> And then once we have picked a delta or fulltext to | |
represent each object, we re-sort by recency, and write them | |
out in that order.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> Yup. Some other small details:</code></pre> | |
</div></div> | |
<div class="paragraph"><p>And of course there is the "Other Shoe" Factor too.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> - We limit the delta depth to another magic value (right | |
now both the window and delta depth magic values are just "10")</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> Hrm, my intuition is that you'd end up with really _bad_ IO | |
patterns, because the things you want are near by, but to | |
actually reconstruct them you may have to jump all over in | |
random ways.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> - When we write out a delta, and we haven't yet written | |
out the object it is a delta against, we write out the base | |
object first. And no, when we reconstruct them, we actually | |
get nice IO patterns, because: | |
- larger objects tend to be "more recent" (Linus' law: files grow) | |
- we actively try to generate deltas from a larger object to a | |
smaller one | |
- this means that the top-of-tree very seldom has deltas | |
(i.e. deltas in _practice_ are "backwards deltas")</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Again, we should reread that whole paragraph. Not just because | |
Linus has slipped Linus’s Law in there on us, but because it is | |
important. Let’s make sure we clarify some of the points here:</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> So the point is just that in practice, delta order and | |
recency order match each other quite well.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> Yes. There's another nice side to this (and yes, it was | |
designed that way ;): | |
- the reason we generate deltas against the larger object is | |
actually a big space saver too!</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> Hmm, but your last comment (if "we haven't yet written out | |
the object it is a delta against, we write out the base object | |
first"), seems like it would make these facts mostly | |
irrelevant because even if in practice you would not have to | |
wander around much, in fact you just brute-force say that in | |
the cases where you might have to wander, don't do that :-)</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> Yes and no. Notice the rule: we only write out the base | |
object first if the delta against it was more recent. That | |
means that you can actually have deltas that refer to a base | |
object that is _not_ close to the delta object, but that only | |
happens when the delta is needed to generate an _old_ object.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> See?</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Yeah, no. I missed that on the first two or three readings myself.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> This keeps the front of the pack dense. The front of the | |
pack never contains data that isn't relevant to a "recent" | |
object. The size optimization comes from our use of xdelta | |
(but is true for many other delta algorithms): removing data | |
is cheaper (in size) than adding data.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>When you remove data, you only need to say "copy bytes n--m". | |
In contrast, in a delta that _adds_ data, you have to say "add | |
these bytes: 'actual data goes here'"</code></pre> | |
</div></div> | |
<div class="ulist"><ul> | |
<li> | |
<p> | |
njs` has quit: Read error: 104 (Connection reset by peer) | |
</p> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> Uhhuh. I hope I didn't blow njs` mind.</code></pre> | |
</div></div> | |
</li> | |
<li> | |
<p> | |
njs` has joined channel #git | |
</p> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><pasky> :)</code></pre> | |
</div></div> | |
</li> | |
</ul></div> | |
<div class="paragraph"><p>The silent observers are amused. Of course.</p></div> | |
<div class="paragraph"><p>And as if njs` was expected to be omniscient:</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> njs - did you miss anything?</code></pre> | |
</div></div> | |
<div class="paragraph"><p>OK, I’ll spell it out. That’s Geek Humor. If njs` was not actually | |
connected for a little bit there, how would he know if missed anything | |
while he was disconnected? He’s a benevolent dictator with a sense of | |
humor! Well noted!</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> Stupid router. Or gremlins, or whatever.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>It’s a cheap shot at Cisco. Take 'em when you can.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> Yes and no. Notice the rule: we only write out the base | |
object first if the delta against it was more recent.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>I'm getting lost in all these orders, let me re-read :-) | |
So the write-out order is from most recent to least recent? | |
(Conceivably it could be the opposite way too, I'm not sure if | |
we've said) though my connection back at home is logging, so I | |
can just read what you said there :-)</code></pre> | |
</div></div> | |
<div class="paragraph"><p>And for those of you paying attention, the Omniscient Trick has just | |
been detailed!</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> Yes, we always write out most recent first</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> And, yeah, I got the part about deeper-in-history stuff | |
having worse IO characteristics, one sort of doesn't care.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> With the caveat that if the "most recent" needs an older | |
object to delta against (hey, shrinking sometimes does | |
happen), we write out the old object with the delta.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> (if only it happened more...)</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> Anyway, the pack-file could easily be denser still, but | |
because it's used both for streaming (the Git protocol) and | |
for on-disk, it has a few pessimizations.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Actually, it is a made-up word. But it is a made-up word being | |
used as setup for a later optimization, which is a real word:</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> In particular, while the pack-file is then compressed, | |
it's compressed just one object at a time, so the actual | |
compression factor is less than it could be in theory. But it | |
means that it's all nice random-access with a simple index to | |
do "object name->location in packfile" translation.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> I'm assuming the real win for delta-ing large->small is | |
more homogeneous statistics for gzip to run over?</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>(You have to put the bytes in one place or another, but | |
putting them in a larger blob wins on compression)</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>Actually, what is the compression strategy -- each delta | |
individually gzipped, the whole file gzipped, somewhere in | |
between, no compression at all, ....?</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>Right.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Reality IRC sets in. For example:</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><pasky> I'll read the rest in the morning, I really have to go | |
sleep or there's no hope whatsoever for me at the today's | |
exam... g'nite all.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Heh.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> pasky: g'nite</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> pasky: 'luck</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> Right: large->small matters exactly because of compression | |
behaviour. If it was non-compressed, it probably wouldn't make | |
any difference.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> yeah</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> Anyway: I'm not even trying to claim that the pack-files | |
are perfect, but they do tend to have a nice balance of | |
density vs ease-of use.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Gasp! OK, saved. That’s a fair Engineering trade off. Close call! | |
In fact, Linus reflects on some Basic Engineering Fundamentals, | |
design options, etc.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> More importantly, they allow Git to still _conceptually_ | |
never deal with deltas at all, and be a "whole object" store.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>Which has some problems (we discussed bad huge-file | |
behaviour on the Git lists the other day), but it does mean | |
that the basic Git concepts are really really simple and | |
straightforward.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>It's all been quite stable.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>Which I think is very much a result of having very simple | |
basic ideas, so that there's never any confusion about what's | |
going on.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code>Bugs happen, but they are "simple" bugs. And bugs that | |
actually get some object store detail wrong are almost always | |
so obvious that they never go anywhere.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> Yeah.</code></pre> | |
</div></div> | |
<div class="paragraph"><p>Nuff said.</p></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><linus> Anyway. I'm off for bed. It's not 6AM here, but I've got | |
three kids, and have to get up early in the morning to send | |
them off. I need my beauty sleep.</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> :-)</code></pre> | |
</div></div> | |
<div class="literalblock"> | |
<div class="content"> | |
<pre><code><njs`> appreciate the infodump, I really was failing to find the | |
details on Git packs :-)</code></pre> | |
</div></div> | |
<div class="paragraph"><p>And now you know the rest of the story.</p></div> | |
</div> | |
</div> | |
</div> | |
<div id="footnotes"><hr /></div> | |
<div id="footer"> | |
<div id="footer-text"> | |
Last updated | |
2017-05-06 21:05:08 PDT | |
</div> | |
</div> | |
</body> | |
</html> | |