Git is an amazing tool for tracking all your changes and reverting them if necessary. While Git is perfect, people are not. So, if you send something by mistake to your repo, like a build file, temporary folder, your cache, and so forth, Git will store it because it can't predict when you make mistakes.
You can, of course, remove files with the git rm
command. Git will remove the file but keep it available in case you need it later. However, if you do this often enough, you end up with what I call a bloated Git repository. It isn't enough to clean up your repo; you also need to remove sensitive files that you could accidentally send to someone.
An example
Walking through an example might make this easier to understand. I'll begin by cloning my repository using --bare
and --mirror
:
git clone --bare --mirror git@github.com:renatosuero/bloated-repo.git
Next, let's take a look at what is in the repository. Pay special attention to each commit and the size of the repo:
$ git log --oneline
b3e1a9d (HEAD -> main) Removed v2
e6f3491 Added code + binary v2
ab86d2e Removed v1
6ac1e2c Added binary v1
250986f Added code v1
#let's check the .git folder size
$ du -hs .git
2.3M .git
I can see in Git that the following files are in the folder:
$ git show --pretty="" 6ac1e2c
diff --git a/build/app.v1 b/build/app.v1
new file mode 100755
index 0000000..15cee01
Binary files /dev/null and b/build/app.v1 differ
Before I can fix this, I need a way to see what is in my Git history. I can do this using Git but it isn't an easy job. Instead, I use a fantastic tool called git-filter-repo. It's available to install on Linux, as well as Windows (through Scoop) and macOS (through Homebrew).
[ Struggling on your cloud journey? Download the free eBook O'Reilly: Accelerating cloud adoption. ]
To analyze a repo using git-filter-repo, I use the following command:
$ git filter-repo --analyze
This command creates a .git/filter-repo/analysis
folder; here's what is in mine:
$ ls -1 .git/filter-repo/analysis
README
blob-shas-and-paths.txt
directories-all-sizes.txt
directories-deleted-sizes.txt
extensions-all-sizes.txt
extensions-deleted-sizes.txt
path-all-sizes.txt
path-deleted-sizes.txt
renames.txt
Looking in path-all-sizes.txt
, I see the files along with their path and size:
$ cat .git/filter-repo/analysis/path-all-sizes.txt
=== All paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name
1869456 1122070 2021-09-17 build/app.v2
1869456 1122070 2021-09-17 build/app.v1
140 162 <present> main.go
This output shows that the binaries are my largest files. I can remove them individually, file by file. However, I know the build
folder isn't necessary, so I'll clean it up:
$ git filter-repo --path build --invert-paths --force
Parsed 5 commits
New history written in 0.03 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at 3c37484 Added code + binary v2
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 8 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 1), reused 0 (delta 0), pack-reused 0
Completely finished after 0.14 seconds.
Then I'll recheck the repo size:
$ du -hs .git
108K .git
Nice improvement, don't you think? This didn't just remove the build
directory, though. It also modified the history of the repository, as seen here:
$ git log --oneline
3c37484 (HEAD -> main) Added code + binary v2
adb0a6e Added code v1
To confirm the operation, I can rerun the steps to analyze and check the file. I see nothing about the build
folder or binaries:
$ cat .git/filter-repo/analysis/path-all-sizes.txt
=== All paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name
280 141 <present> main.go
Amazing! But now I need to apply the change by running git push -f
to send and force the update. There are some caveats about this method (recall that I used --bare
and --mirror
to clone the repo):
- The tool ignores all branches, so if you work with branches (such as production, development, or feature), they will be lost.
- You will lose any opened pull request (PR) or merge request (MR).
- Uncommitted changes on a developer's machine can't be merged. (This requires a different commit hash.)
An unfortunate side effect of cleaning up the repo is you lose the things I listed above. My recommendations are:
- Visit the branches, PRs, and MRs first to see what can be removed.
- Maintain the current repo and create a new one to clean up so that you can maintain the branches, PRs, MRs, and your history.
If you need something like a branch or uncommitted changes on a developer's machine, you must manually copy them and apply them to the new code.
This can be annoying at first, but soon people will be able to clone their repo without getting messages like "Clone the repo and grab a coffee because it will take a long time." It will also improve your development pipeline when your repo is smaller.
Revised Git
This tool isn't for every repository or everyone's workflow. You have to feel comfortable with rewriting history, which for some repositories is a much-needed correction. If your repository's history is more like baggage than backstory, try git-filter-repo.
About the author
Renato Suero is a software engineer who is getting paid to participate in his hobby (aka he loves his profession). He is a lifelong tech geek and is currently learning backend and servers. When he isn't coding, studying, or binging a new series, he is spending time with his son Enrico and his wife Marina.
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit