Inside Git: How It Works

As software professionals, Git is a part of our daily lives. Terms like fetch, merge, pull, commit are etched into our minds. But how many of us actually know how Git works behind the scenes? How it is able to keep track of every change and give us the exact code whenever we want?

What is Git?

Git is something called a Version Control System. If you want to read more on version control systems and why we need them, visit my other article. In short, it basically tracks changes to any kind of content, it does not even have to be a software project. It can track changes to any kind of file. According to Git’s own home page:

Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

That was basic stuff, but if we want to be a bit more technical, we could say that Git is a content-addressable filesystem. That means in essence, Git works with a simple key-value data store. The key is the content address, and the value is the data/content. If you want to read more on this, visit Git’s own book here.

How does Git track stuff?

Git will assign a unique key to the value (whatever content we want to track), and it is basically converted into a series of bytes. Whenever there is change or an update to the value, Git stores the new value and creates a hash key for it using SHA-1. These keys allow Git to identify the specific value. If the value is the same (no change in content) then the same hash key will be generated.

Understanding the .git folder

The .git folder provides all of the information that Git needs to track changes to your codebase and is a key component of the Git workflow.

Let's begin with a summary of the contents of the .git folder before getting into the specifics. The .git folder is created in your project's root directory when you setup a new Git repository. Various files and folders containing details about your codebase can be found inside this folder.

Content	Description
`HEAD` File	Keeps track of the current branch
`refs` Folder	Stores references to commits and branches
`objects` Folder	Stores codebase as a series of snapshots
`config` File	Stores configuration information for Git
`hooks` Folder	Runs scripts at specific points in the Git workflow

The HEAD file: Keeps track of the current branch

The HEAD file is a basic text file containing the SHA hash of the currently checked-out commit in your repository. This file is used to monitor your current branch and is automatically updated each time you checkout a certain commit or switch branches.

The refs folder: Stores references to commits and branches

Git keeps references to commits and branches of the repository in the refs folder. There are many subfolders in this folder, and each one represents a different kind of reference. For example, references to the heads of branches in repository is present in the heads subfolder, and references to any tags created is present in the tags subfolder.

The `objects` folder: stores codebase as a series of snapshots

The objects folder is where the codebase itself is stored as a series of snapshots (more on objects later). Each snapshot represents the state of the codebase at a specific point in time, and Git uses these to track changes to the code over time. Inside this folder, there are subfolders info and pack. pack contains compressed snapshots of the codebase, and info contains metadata about those snpashots.

The `config` file: stores configuration information for Git

In the config file, Git stores information about the repository. This includes various settings that control how Git behaves, such as the username and email address, the default branch for checkouts, and Git’s behaviour for merge and diff tools.

The `hooks` folder: runs scripts at specific points in the Git workflow

The hooks folder is where you can add custom scripts that run at specific points in the Git workflow. For example, you can add a script that runs before each commit to make sure that your code meets certain quality standards, or a script that runs after each checkout to set up your development environment.

Git Objects: Blob, Tree, and Commit

Git objects are the fundamental building blocks of a Git repository, used to store data such as commits, files, and directories.

Blob objects

A blob (binary large object) is a type of Git object that stores the contents of a file as a snapshot in the Git object database. To generate the hash for a blob object, Git calculates the SHA-1 hash of the file contents, which is later used to locate the blob object in the Git object database (.git/objects). When you make changes to a file and commit them, Git stores the new content of the file (fully, not just the diff) as a new blob object with a new SHA-1 hash. This allows Git to track the changes to the file over time and manage the file’s history. One thing to keep in mind is that blobs do not store the name of the file(s), only the contents of it. Then, a question arises - how can we know which file has what content? Here’s where tree objects come in.
Tree objects
A tree object represents the contents of a directory in a Git repository. It enables Git to store a collection of files and directories together in a single object. It does this by including a list of the files and directories contained in the directory, along with the SHA-1 object name (a unique identifier) of each file or directory. This creates a mapping between the names of the files and directories to their corresponding hashes.
Commit objects

A commit object represents a snapshot of the repository at a particular point in time. It stores a reference to a tree object, which represents the state of the repository’s directories and their respective files at the time of the commit, and also a reference to one or more parent commit objects. Commit objects also include metadata such as the commit message, author, and timestamp. Commit objects are created when you run git commit, and they form a linear chain of commits in the repository’s history. Each commit object is linked to its parent(s) through a reference, creating a linked list of commits.

In summary:

Commit points to one tree (the root directory snapshot) and to its parent commit.
Tree represents a directory and contains:
- Filenames
- Pointers to blobs (files)
- Pointers to other trees (subfolders)
Blob stores only file content, not filenames.

What happens internally with `git add` and `git commit`

Let us try to understand the entire process, from adding a file to the staging area to committing it, through an example.

When we added the file to our staging area, we essentially took a snapshot of the contents of server.js. To take this snapshot, we create a blob object and generate a SHA-1 hash of the contents of the file, server.js. for the key.

Now, if you run “find .git/objects -type f” in the command line, it’ll come up with a file path that resembles a SHA-1 Hash — because that’s exactly what it is!

(I discovered while doing this - if there is no content in the file, then it will return nothing. This confused me at first, but then I looked it up, and added some content. So, the hash e6/9de2… is the one with no content, and the hash b9/de1626… is with updated content. Below, I check with the latter hash.)

To check the value (or the content of the .git/object that we just found), we can use git cat-file -p <HASH>. The git cat-file command will display the contents of a git object, such as a blob in this example.

In this example, we are operating at the lowest level of abstraction (tracking only files, not directories, etc), so we are just taking the SHA-1 of files and creating blob objects. Now, we can also create a tree object. How? We know that a commit object is a reference to a tree object. So, if we now git commit our staged changes, we not only create a tree object, but also a commit object.

Here, we re-use the git cat-file command to explore the tree object that’s pointed to by the latest commit object on the main branch currently. Then, to explore the commit object that is pointing to the tree object, we can use git rev-parse to first find the hash of the latest commit on the main branch:

Then we can use the git cat-file command again to display the contents of the commit object, using the SHA-1 object name obtained in the above step:

Finally, let’s list all of the git objects in our .git/objects folder by using find .git/objects -type f

We can see the 3 git objects we’ve explored using git plumbing commands: b9de16.. (blob), bf066cb…(tree), 097ed… (commit). We can double-check these via the hashes we found in the previous steps.

Conclusion

So in the end, I hope this article helped you understand what actually is Git, and its the internal workings. We learned that Git is a version control system, so it can track changes to any kind of content over time. Interanally, we can say that it is a content-addressable file system, which basically means it works with a key-value system, with key being the content address, and the value being data/content. We learned the structure of the .git folder, and learned about blob, tree, and commit objects. We also learned how these obejcts are created and manipulated internally when we run commands git add and git commit.

Inside Git: How It Works

What is Git?

How does Git track stuff?

Understanding the .git folder

The `HEAD` file: Keeps track of the current branch

The `refs` folder: Stores references to commits and branches

The `objects` folder: stores codebase as a series of snapshots

The `config` file: stores configuration information for Git

The `hooks` folder: runs scripts at specific points in the Git workflow

Git Objects: Blob, Tree, and Commit

Blob objects

Commit objects

What happens internally with `git add` and `git commit`

Conclusion

Comments

More from this blog

DNS Record Types Explained

Git for Beginners: Basics and Essential Commands

TCP vs UDP: The Basics

Why Version Control Exists: The Pendrive Problem

Command Palette

What is Git?

How does Git track stuff?

Understanding the .git folder

The HEAD file: Keeps track of the current branch

The refs folder: Stores references to commits and branches

The objects folder: stores codebase as a series of snapshots

The config file: stores configuration information for Git

The hooks folder: runs scripts at specific points in the Git workflow

Git Objects: Blob, Tree, and Commit

Blob objects

Commit objects

What happens internally with git add and git commit

Conclusion

Comments

More from this blog

The `HEAD` file: Keeps track of the current branch

The `refs` folder: Stores references to commits and branches

The `objects` folder: stores codebase as a series of snapshots

The `config` file: stores configuration information for Git

The `hooks` folder: runs scripts at specific points in the Git workflow

What happens internally with `git add` and `git commit`