Skip to main content

Clean up your Git repository with this Linux tool

Bloated Git repositories may contain sensitive files and can slow your pipeline. Try git-filter-repo to eliminate the mess.
Washing windows

Photo by Nathan Cowley from Pexels

Git is an amazing tool for tracking all your changes and reverting them if necessary. While Git is perfect, people are not. So, if you send something by mistake to your repo, like a build file, temporary folder, your cache, and so forth, Git will store it because it can't predict when you make mistakes.

You can, of course, remove files with the git rm command. Git will remove the file but keep it available in case you need it later. However, if you do this often enough, you end up with what I call a bloated Git repository. It isn't enough to clean up your repo; you also need to remove sensitive files that you could accidentally send to someone.

An example

Walking through an example might make this easier to understand. I'll begin by cloning my repository using --bare and --mirror:

git clone --bare --mirror

Next, let's take a look at what is in the repository. Pay special attention to each commit and the size of the repo:

$ git log --oneline
b3e1a9d (HEAD -> main) Removed v2
e6f3491 Added code + binary v2
ab86d2e Removed v1
6ac1e2c Added binary v1
250986f Added code v1
#let's check the .git folder size
$ du -hs .git
2.3M .git

I can see in Git that the following files are in the folder:

$ git show --pretty="" 6ac1e2c
diff --git a/build/app.v1 b/build/app.v1
new file mode 100755
index 0000000..15cee01
Binary files /dev/null and b/build/app.v1 differ

Before I can fix this, I need a way to see what is in my Git history. I can do this using Git but it isn't an easy job. Instead, I use a fantastic tool called git-filter-repo. It's available to install on Linux, as well as Windows (through Scoop) and macOS (through Homebrew).

[ Struggling on your cloud journey? Download the free eBook O'Reilly: Accelerating cloud adoption. ]

To analyze a repo using git-filter-repo, I use the following command:

$ git filter-repo --analyze

This command creates a .git/filter-repo/analysis folder; here's what is in mine:

$ ls -1 .git/filter-repo/analysis

Looking in path-all-sizes.txt, I see the files along with their path and size:

$ cat .git/filter-repo/analysis/path-all-sizes.txt
=== All paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name
1869456 1122070 2021-09-17 build/app.v2
1869456 1122070 2021-09-17 build/app.v1
    140 162 <present> main.go

This output shows that the binaries are my largest files. I can remove them individually, file by file. However, I know the build folder isn't necessary, so I'll clean it up:

$ git filter-repo --path build --invert-paths --force
Parsed 5 commits
New history written in 0.03 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at 3c37484 Added code + binary v2
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 8 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 1), reused 0 (delta 0), pack-reused 0
Completely finished after 0.14 seconds.

Then I'll recheck the repo size:

$ du -hs .git
108K .git

Nice improvement, don't you think? This didn't just remove the build directory, though. It also modified the history of the repository, as seen here:

$ git log --oneline
3c37484 (HEAD -> main) Added code + binary v2
adb0a6e Added code v1

To confirm the operation, I can rerun the steps to analyze and check the file. I see nothing about the build folder or binaries:

$ cat .git/filter-repo/analysis/path-all-sizes.txt
=== All paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name
   280 141 <present> main.go

Amazing! But now I need to apply the change by running git push -f to send and force the update. There are some caveats about this method (recall that I used --bare and --mirror to clone the repo):

  • The tool ignores all branches, so if you work with branches (such as production, development, or feature), they will be lost.
  • You will lose any opened pull request (PR) or merge request (MR).
  • Uncommitted changes on a developer's machine can't be merged. (This requires a different commit hash.)

An unfortunate side effect of cleaning up the repo is you lose the things I listed above. My recommendations are:

  • Visit the branches, PRs, and MRs first to see what can be removed.
  • Maintain the current repo and create a new one to clean up so that you can maintain the branches, PRs, MRs, and your history.

If you need something like a branch or uncommitted changes on a developer's machine, you must manually copy them and apply them to the new code.

This can be annoying at first, but soon people will be able to clone their repo without getting messages like "Clone the repo and grab a coffee because it will take a long time." It will also improve your development pipeline when your repo is smaller.

Revised Git

This tool isn't for every repository or everyone's workflow. You have to feel comfortable with rewriting history, which for some repositories is a much-needed correction. If your repository's history is more like baggage than backstory, try git-filter-repo.

Check out these related articles on Enable Sysadmin

Topics:   Git   Command line utilities  
Author’s photo

Renato Suero

Renato Suero is a software engineer who is getting paid to participate in his hobby (aka he loves his profession). He is a lifelong tech geek and is currently learning backend and servers. More about me

On Demand: Red Hat Summit 2021 Virtual Experience

Relive our April event with demos, keynotes, and technical sessions from
experts, all available on demand.

Related Content