Undefined Behavior :: git commit --vandalism

This is not a vulnerability report. I intentionally disabled safeguards in the code in order to be able observe this behavior. However, it's a nice example for why you shouldn't use SHA-1 for anything security related and a good opportunity to learn more about the inner workings of git.

In 2017, the Cryptology Group at Centrum Wiskunde & Informatica (CWI) and the Google Research Security, Privacy and Anti-abuse Group announced the first (public) SHA-1 hash collision. They generated 2 PDF files with the same hash in an attack they called SHAttered. While it still required a lot of processing power (the equivalent of 6,500 years single-CPU computations and 110 years of single-GPU computations), it showed that such attacks are not only theoretically possible - they are technically feasible. ¹

IDs in git, for example for commits, are generated using SHA-1 by default. As explained by git's creator, Linus Torvalds, the hash function is not used for security. It's simply used to generate a checksum, like e.g. a CRC. The quote "We check a checksum that's cryptographically secure. Nobody has been able to break SHA-1" didn't hold up that well, but in general, Linus' point still stands: git uses SHA-1 for consistency checks, not for security. ² But what happens if a collision occurs, or someone intentionally causes a collision? Let's try it out.

So if the SHAttered files already have a hash collision, they could be used to cause a collision in git, right? Let's put them in a repository and find out.

$ wget https://shattered.io/static/shattered-1.pdf -O shattered1.pdf -q
$ sha1sum shattered-1.pdf
38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-1.pdf
$ wget https://shattered.io/static/shattered-2.pdf -O shattered2.pdf -q
$ sha1sum shattered-2.pdf
38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-2.pdf
$ diff shattered-1.pdf shattered-2.pdf
Binary files shattered-1.pdf and shattered-2.pdf differ

As promised, these files have different contents, but the same SHA-1 hash. The next step is to add them to a git repository and observe what happens.

$ cp shattered-1.pdf shattered.pdf
$ git add shattered.pdf
$ git commit -m 'shattered'
[main (Root-Commit) 0aa8c5a] shattered
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 shattered.pdf
$ cp shattered-2.pdf shattered.pdf -f
$ git add shattered.pdf
$ git commit -m 'shattered'
[main f1906a7] shattered
1 file changed, 0 insertions(+), 0 deletions(-)

Looking at the result, this did not seem to cause any issues. If the commit ID is a SHA-1 hash, shouldn't this cause a commit ID collision? Such a collision did not happen. The file was updated successfully in the second commit and checking out the first commit restores the original version. Investigating this closer shows that git somehow calculates different hashes for these files.

$ git hash-object shattered-1.pdf
b621eeccd5c7edac9b7dcba35a8d5afd075e24f2
$ git hash-object shattered-2.pdf
ba9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0

Internally, git is a key-value store. It generates an ID for each object that it stores, which can later be used to identify and retrieve the object. By default, this ID is a SHA-1 hash. The input given to the hash function depends on the object type. For files, which are stored as so called blobs, the ID is generated by hashing the string blob, the length of the file, a null byte and the file itself. ³

$ echo "so long and thanks for all the fish" > test.txt
$ git hash-object test.txt
8b86cb67f1f19db567a100b55edb5466a33e7fb7
$ printf "blob $(wc -c <test.txt)\0$(<test.txt)\n" | sha1sum
8b86cb67f1f19db567a100b55edb5466a33e7fb7  -

Blobs can be assigned names to using tree objects. These objects can group multiple blobs and trees, which are referenced by their id. ³ A tree object could look like this for example:

$ git cat-file -p 8004c8a7b6fce1452539556bb4c4c91b92b5c2bc
100644 blob ba9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0    shattered.pdf

However, one does not usually work directly with blobs and trees. It's what git uses internally, but from a user perspective, it's usually not necessary to interact with git on that level. The usual workflow consists out of editing files, staging these changes and then committing them. Commits are just another object type in git. These objects associate a tree with an author and committer, The author and the committer can be different. The author is who wrote the code and the committer is who added it to the repository. a parent commit (if it exists) and a commit message to describe to changes included in this commit. ³ The resulting object can once again be viewed with git cat-file.

$ git cat-file -p f1906a7bf1c4c8ce49f264e2f3ba313c305b7ede
tree 8004c8a7b6fce1452539556bb4c4c91b92b5c2bc
parent 0aa8c5ad0be568d680d5f613807bde46a39424e3
author error <redacted> 1683729684 +0200
committer error <redacted> 1683729684 +0200

shattered

The commit ID is generated by hashing this information, prepended by the same header that is used for the other object types: $object-type $length\0. You can find a nice example how the commit hash is constructed here.

OK, so that's the reason why git hash-object returns different values for the two SHAttered files. Try it yourself if you want: prepending any data to these files will make their SHA-1 hashes differ. echo test | cat - shatterd-1.pdf | sha1sum results in a different hash than echo test | cat - shatterd-2.pdf | sha1sum Further, this explains why the commit hashes didn't collide. There won't be any progress made by using the existing SHA-1 hash collision from SHAttered, but collisions can still happen. It's unlikely to find one by accident and it's expensive to cause them intentionally, but it's not impossible. How would git behave in this case? Let's vandalize the code a little bit and find out.

Git can be built with multiple different SHA-1 backends. By default, it uses an implementation with a collision attack detection mechanism. This collision attack detection is not relevant for our experiment for now. We'll get back to it later. It can be found in the SHA1DC directory.

int SHA1DCFinal(unsigned char output[20], SHA1_CTX *ctx)
{
    uint32_t last = ctx->total & 63;
    uint32_t padn = (last < 56) ? (56 - last) : (120 - last);
    uint64_t total;
    SHA1DCUpdate(ctx, (const char*)(sha1_padding), padn);

    total = ctx->total - padn;
    total <<= 3;
    ctx->buffer[56] = (unsigned char)(total >> 56);
    ctx->buffer[57] = (unsigned char)(total >> 48);
    ctx->buffer[58] = (unsigned char)(total >> 40);
    ctx->buffer[59] = (unsigned char)(total >> 32);
    ctx->buffer[60] = (unsigned char)(total >> 24);
    ctx->buffer[61] = (unsigned char)(total >> 16);
    ctx->buffer[62] = (unsigned char)(total >> 8);
    ctx->buffer[63] = (unsigned char)(total);
    sha1_process(ctx, (uint32_t*)(ctx->buffer));
    output[0] = (unsigned char)(ctx->ihv[0] >> 24);
    output[1] = 1;  //(unsigned char)(ctx->ihv[0] >> 16);
    output[2] = 1;  //(unsigned char)(ctx->ihv[0] >> 8);
    output[3] = 1;  //(unsigned char)(ctx->ihv[0]);
    output[4] = 1;  //(unsigned char)(ctx->ihv[1] >> 24);
    output[5] = 1;  //(unsigned char)(ctx->ihv[1] >> 16);
    output[6] = 1;  //(unsigned char)(ctx->ihv[1] >> 8);
    output[7] = 1;  //(unsigned char)(ctx->ihv[1]);
    output[8] = 1;  //(unsigned char)(ctx->ihv[2] >> 24);
    output[9] = 1;  //(unsigned char)(ctx->ihv[2] >> 16);
    output[10] = 1; //(unsigned char)(ctx->ihv[2] >> 8);
    output[11] = 1; //(unsigned char)(ctx->ihv[2]);
    output[12] = 1; //(unsigned char)(ctx->ihv[3] >> 24);
    output[13] = 1; //(unsigned char)(ctx->ihv[3] >> 16);
    output[14] = 1; //(unsigned char)(ctx->ihv[3] >> 8);
    output[15] = 1; //(unsigned char)(ctx->ihv[3]);
    output[16] = 1; //(unsigned char)(ctx->ihv[4] >> 24);
    output[17] = 1; //(unsigned char)(ctx->ihv[4] >> 16);
    output[18] = 1; //(unsigned char)(ctx->ihv[4] >> 8);
    output[19] = 1; //(unsigned char)(ctx->ihv[4]);
    return ctx->found_collision;
}

In order to cause collisions between object IDs in git, it is helpful to reduce the size of the identifier. To achieve this, every byte of the hash after the first one was set to 1. This reduces the effective length of the identifier to 1 byte. 256 values may still sound like a lot, but there's no need to check every possible value. We're not looking for a specific hash, any collision will do. The first collisions should occur after a few tries. Such collisions are already very likely after a surprisingly small amount of attempts if the value range is not too large. Look up the birthday problem if you don't know it.

So let's compile git with our "improvement" and try to cause some collisions.

$ touch test
$ ../git/git add test
$ ../git/git commit -m 'test'
[main (root-commit) c101010] test
1 file changed, 1 insertion(+)
create mode 100644 test
$ touch test2
$ ../git/git add test2
$ ../git/git commit -m 'test2'
[main db01010] test2
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 test2
$ ../git/git log
commit db01010101010101010101010101010101010101 (HEAD -> main)
Author: error <redacted>
Date:   Sun May 7 20:32:54 2023 +0200

    test2

commit c101010101010101010101010101010101010101
Author: error <redacted>
Date:   Sun May 7 20:32:00 2023 +0200

    test

The fist two commits may not collide, but it's already obvious what our little modification to the source code did: only the first 2 characters of the commit hash differ, everything afterwards is filled with the same two characters.

Continuing to add and commits to this repository, the first collision was observed on the 7th attempt.

$ touch test7
$ ../git/git add test7
$ ../git/git commit -m 'test7'
fatal: 0f01010101010101010101010101010101010101 is not a valid 'tree' object
$ ../git/git cat-file -p 0f01010101010101010101010101010101010101
tree 8501010101010101010101010101010101010101
parent e801010101010101010101010101010101010101
author error <redacted> 1683484473 +0200
committer error <redacted> 1683484473 +0200

test5

The collision happened between the tree object for the new commit and an already existing commit. Interestingly, git did notice this problem only when it tried to add the tree to the commit and found an object of a wrong type instead. The original object was not modified in the process. It looks like the attempt to add a new object failed silently.

On the 15th attempt, another interesting collision occurred:

$ touch test15
$ ../git/git add test15
$ ../git/git commit -m 'test15'
[main e201010] test12
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 test12
$ ../git/git log
commit e201010101010101010101010101010101010101 (HEAD -> main)
Author: error <redacted>
Date:   Sun May 7 21:12:38 2023 +0200

test12

commit 3b01010101010101010101010101010101010101
Author: error <redacted>
Date:   Sun May 7 21:12:06 2023 +0200

test11

commit 6001010101010101010101010101010101010101
Author: error <redacted>
Date:   Sun May 7 21:11:40 2023 +0200

test10

In this case both of the colliding objects are commits. This time, git did not even print an error message. After the creation of the new commit object once again failed silently, git proceeded to check out the existing commit with the same hash. People assume that a commit history is a strict progression of cause to effect. But actually from a non-linear, non-subjective viewpoint it's more like a big ball of wibbly-wobbly timey-wimey... stuff. This confirms the assumption that git will keep the existing object in case of a collision. The other commits that got rolled back are still present as objects in git and can be checked out, but especially if the user does not notice what happened here and continues working, this can seriously mess up the commit history.

Those tests were continued until the following collisions occurred and git's behavior could be observed:

Collisions between two blobs: Creating the new blob object fails silently. The existing blob object remains unchanged.
Collisions between blobs and trees: Creating the new blob object fails silently. The existing tree object remains unchanged.
Collisions between blobs and commits: Creating the new blob object fails silently. The existing commit object remains unchanged.
Collisions between blobs and tags: Creating the new blob object fails silently. The existing tag object remains unchanged.
Collisions between trees and blobs: Creating the new tree object fails silently. The existing blob object remains unchanged. Trying to commit this results in an error message complaining that the object is not a valid tree object.
Collisions between two trees: Creating the tree object fails silently. The existing tree object remains unchanged. Trying to commit this, git commits the old tree again. This effectively means a rollback to the commit associated with of the old tree while keeping the commit history.
Collisions between trees and commits: Creating the new tree object fails silently. The existing commit object remains unchanged. Trying to commit this results in an error message complaining that the object is not a valid tree object.
Collisions between trees and tags: Creating the new tree object fails silently. The existing tag object remains unchanged. Trying to commit this results in an error message complaining that the object is not a valid tree object.
Collisions between commits and blobs: Creating the new commit object fails. The existing blob object remains unchanged. Git attempts to check out the new commit (that wasn't created) and fails with the error message fatal: cannot update ref 'refs/heads/main': trying to write non-commit object $HASH to branch 'refs/heads/main'.
Collisions between commits and trees: Creating the new commit object fails. The existing tree object remains unchanged. Git attempts to check out the new commit (that wasn't created) and fails with the error message fatal: cannot update ref 'refs/heads/main': trying to write non-commit object $HASH to branch 'refs/heads/main'.
Collisions between two commits: Creating the new commit object fails silently. The existing commit remains unchanged and is checked out.
Collisions between commits and tags: Creating the new commit object fails. The existing tag object remains unchanged. Git attempts to check out the new commit (that wasn't created) and fails with the error message fatal: cannot update ref 'refs/heads/main': trying to write non-commit object $HASH to branch 'refs/heads/main'.
Collisions between tags and blobs: A new file is created under .git/refs/tags pointing at the hash, but the creation of the new tag object under .git/objects fails silently. The existing blob object remains unchanged.
Collisions between tags and trees: A new file is created under .git/refs/tags pointing at the hash, but the creation of the new tag object under .git/objects fails silently. The existing tree object remains unchanged. The tag is displayed by git tag -l, but attempts at checking it out fail with the error message fatal: Cannot switch branch to a non-commit.
Collisions between tags and commits: A new file is created under .git/refs/tags pointing at the hash, but the creation of this new tag object under .git/objects fails silently. The existing commit object remains unchanged. The tag is displayed by git tag -l, but is interpreted as a lightweight tag for the colliding commit.
Collisions between two tags: A new file is created under .git/refs/tags pointing at the hash, but the creation of this new tag object under .git/objects fails silently. The existing tag object remains unchanged. Since the new tag reference object points at the old tag object, the new tag will be an alias for the old tag, with the same message and object reference.

Git does not overwrite existing objects. Creating a new object with a hash that's already associated with another object always fails silently. However, depending on the object type, git shows some interesting behavior. If the object type fits, git will just continue with it as if nothing is wrong, e.g. it will commit an old tree or checkout an old commit. Git only responds with an error if the object types are incompatible, e.g. if the collision causes a reference to a tree to point at a tag object instead.

Hash collisions, intentional oŕ not, can become a problem for git. To mitigate this risk, git includes a mechanism that detects SHA-1 collision attacks and reacts by hashing the suspected block 3 times, extending SHA-1 from 80 to 240 steps in these cases. This ensures that different hashes are generated in theses cases.⁴

Further, git does not only support SHA-1. It supports SHA-256, too. Unlike SHA-1, SHA-256 is considered cryptographically secure.

Actual SHA-1 collisions are very unlikely to occur as a coincidence. The collisions in this experiment could only be observed after the code was modified to limit the effective hash size to 1 byte. Otherwise, it would not have been possible to create collisions with the available resources. However, SHAttered shows that it is feasible to intentionally cause such a collision. Whether or not attackers with the resources required to do this are part of your threat model is up to you to decide.

In case of a collision, git shows some interesting behavior that may not be noticed immediately. This may leave the git repository in an unintended state. Further, attackers could modify files and commits by replacing objects with specifically crafted objects that produce the same hash. This risk is mitigated by the use of the collision attack detection mechanism.

Even though the risk is already mitigated, this example shows why SHA-1 should not be used for cryptographic hashing and was a good opportunity to learn more about git itself. If you want to try it out yourself, you can have a look at the code with my modification here.

M. Stevens, E. Bursztein, P. Karpman, A. Albertini and Y. Markov (2017, February) "The first collision for full SHA-1", https://shattered.io/static/shattered.pdf. ↩
L. Torvalds (2007, May) "Tech Talk: Linus Torvalds on git", https://www.youtube.com/watch?v=4XpnKHJAok8&t=56m20s. ↩
S. Chacon, B. Straub et al. (2014) "10.2 Git Internals - Git Objects" in Pro Git, https://git-scm.com/book/en/v2/Git-Internals-Git-Objects. ↩↩↩
M. Stevens, D. Shumow (2017) "sha1dc/sha1.h", https://github.com/git/git/blob/master/sha1dc/sha1.h. ↩