Git is an amazing tool for tracking all your changes and reverting them if necessary. While Git is perfect, people are not. So, if you send something by mistake to your repo, like a build file, temporary folder, your cache, and so forth, Git will store it because it can't predict when you make mistakes.
You can, of course, remove files with the
git rm command. Git will remove the file but keep it available in case you need it later. However, if you do this often enough, you end up with what I call a bloated Git repository. It isn't enough to clean up your repo; you also need to remove sensitive files that you could accidentally send to someone.
Walking through an example might make this easier to understand. I'll begin by cloning my repository using
git clone --bare --mirror email@example.com:renatosuero/bloated-repo.git
Next, let's take a look at what is in the repository. Pay special attention to each commit and the size of the repo:
$ git log --oneline b3e1a9d (HEAD -> main) Removed v2 e6f3491 Added code + binary v2 ab86d2e Removed v1 6ac1e2c Added binary v1 250986f Added code v1 #let's check the .git folder size $ du -hs .git 2.3M .git
I can see in Git that the following files are in the folder:
$ git show --pretty="" 6ac1e2c diff --git a/build/app.v1 b/build/app.v1 new file mode 100755 index 0000000..15cee01 Binary files /dev/null and b/build/app.v1 differ
Before I can fix this, I need a way to see what is in my Git history. I can do this using Git but it isn't an easy job. Instead, I use a fantastic tool called git-filter-repo. It's available to install on Linux, as well as Windows (through Scoop) and macOS (through Homebrew).
[ Struggling on your cloud journey? Download the free eBook O'Reilly: Accelerating cloud adoption. ]
To analyze a repo using git-filter-repo, I use the following command:
$ git filter-repo --analyze
This command creates a
.git/filter-repo/analysis folder; here's what is in mine:
$ ls -1 .git/filter-repo/analysis README blob-shas-and-paths.txt directories-all-sizes.txt directories-deleted-sizes.txt extensions-all-sizes.txt extensions-deleted-sizes.txt path-all-sizes.txt path-deleted-sizes.txt renames.txt
path-all-sizes.txt, I see the files along with their path and size:
$ cat .git/filter-repo/analysis/path-all-sizes.txt === All paths by reverse accumulated size === Format: unpacked size, packed size, date deleted, path name 1869456 1122070 2021-09-17 build/app.v2 1869456 1122070 2021-09-17 build/app.v1 140 162 <present> main.go
This output shows that the binaries are my largest files. I can remove them individually, file by file. However, I know the
build folder isn't necessary, so I'll clean it up:
$ git filter-repo --path build --invert-paths --force Parsed 5 commits New history written in 0.03 seconds; now repacking/cleaning... Repacking your repo and cleaning out old unneeded objects HEAD is now at 3c37484 Added code + binary v2 Enumerating objects: 6, done. Counting objects: 100% (6/6), done. Delta compression using up to 8 threads Compressing objects: 100% (4/4), done. Writing objects: 100% (6/6), done. Total 6 (delta 1), reused 0 (delta 0), pack-reused 0 Completely finished after 0.14 seconds.
Then I'll recheck the repo size:
$ du -hs .git 108K .git
Nice improvement, don't you think? This didn't just remove the
build directory, though. It also modified the history of the repository, as seen here:
$ git log --oneline 3c37484 (HEAD -> main) Added code + binary v2 adb0a6e Added code v1
To confirm the operation, I can rerun the steps to analyze and check the file. I see nothing about the
build folder or binaries:
$ cat .git/filter-repo/analysis/path-all-sizes.txt === All paths by reverse accumulated size === Format: unpacked size, packed size, date deleted, path name 280 141 <present> main.go
Amazing! But now I need to apply the change by running
git push -f to send and force the update. There are some caveats about this method (recall that I used
--mirror to clone the repo):
- The tool ignores all branches, so if you work with branches (such as production, development, or feature), they will be lost.
- You will lose any opened pull request (PR) or merge request (MR).
- Uncommitted changes on a developer's machine can't be merged. (This requires a different commit hash.)
An unfortunate side effect of cleaning up the repo is you lose the things I listed above. My recommendations are:
- Visit the branches, PRs, and MRs first to see what can be removed.
- Maintain the current repo and create a new one to clean up so that you can maintain the branches, PRs, MRs, and your history.
If you need something like a branch or uncommitted changes on a developer's machine, you must manually copy them and apply them to the new code.
This can be annoying at first, but soon people will be able to clone their repo without getting messages like "Clone the repo and grab a coffee because it will take a long time." It will also improve your development pipeline when your repo is smaller.
This tool isn't for every repository or everyone's workflow. You have to feel comfortable with rewriting history, which for some repositories is a much-needed correction. If your repository's history is more like baggage than backstory, try git-filter-repo.