Whether you're a sysadmin by career or you're just dealing with a home business server in your broom closet that's running out of space, you probably need to find out where disk space is going to waste. Tools like
df will get you part of the way there. But when you need to clean up space quickly, you need a different type of utility to fill the gap.
[ Learn how to manage your Linux environment for success. ]
This is where my Top Disk Consumer Report Generator utility comes in handy. This open source tool (GPLv3) allows you to clear space with minimum detective work required. By running
topdiskconsumer from any directory on the filesystem, it reports the overall free space, the largest files, directories, and older files, and deleted files with space still in use.
This utility can be used for routine housekeeping, troubleshooting disk usage on production servers, and as a training tool for junior sysadmins.
A few years ago, I created this script and have maintained (and rewritten) it since. It was initially poorly coded, as it grew organically. I tacked on each feature as I needed to gather info to include in ticket updates to justify actions taken on servers. My little one-liner eventually grew into a 2.5KB behemoth and was widely used by customers of a large hosting company. However, apart from a variable (
intNumFiles) being assignable at the beginning of the script, it was entirely hard-coded and awkwardly parsed utilities like
df to get specific stats in a very inefficient way.
I had always intended to rewrite it as a full command-line utility, including switches to turn off specific reports, timeouts, formatting options other than bbcode, and a help screen.
I still have my original version of this script, and if it weren't for how it grew organically and as a teaching tool for my mentees, I'd feel embarrassed at how bad the code is. What makes me proud of it is that I have been able to teach junior engineers how to take information and tools they have and parse and reuse output to drive other tools.
Install Top Disk Consumer Report Generator
The installation process is straightforward. Here are the basic steps:
1. Open the storagetoolkit repository in a web browser.
2. Click the green CODE button above the file listings.
3. Click the COPY button.
4. Open a terminal and change to the directory where you want to download the files.
git clone https://github.com/klazarsk/storagetoolkit.git to clone the project.
cd storagetoolkit to change to the subdirectory Git created for you.
sudo chmod a+x to ensure the script has execute privileges.
8. List the files; your username will own them:
$ sudo chmod a+x topdiskconsumer [klazarsk@klazarsk storagetoolkit]$ ls -lh total 16K -rw-rw-r--. 1 klazarsk klazarsk 3.4K Jan 6 15:29 README.md -rwxrwxr-x. 1 klazarsk klazarsk 11K Jan 6 15:29 topdiskconsumer
sudo or root privileges to copy the file to
/usr/bin or another system directory in the search path (or
~/.local/bin of the account you wish to run it from, if you do not want it in your system directory):
$ sudo cp topdiskconsumer /usr/bin -v [sudo] password for klazarsk: 'topdiskconsumer' -> '/usr/bin/topdiskconsumer' [klazarsk@klazarsk storagetoolkit]$ ls -lh /usr/bin/topdiskconsumer -rwxr-xr-x. 1 root root 11K Jan 6 15:32 /usr/bin/topdiskconsumer
Now you can execute the file from anywhere on the system.
Find which files and directories are using storage
To use this tool, place the
topdiskconsumer file into a directory in your search path,
chmod it as executable, and run it from any directory on the filesystem you want to analyze.
The script will output a report of overall disk usage on the filesystem, list the top 20 largest files, the largest 20 directories, and the top 20 largest files older than 30 days. It will also find "ghosts" by identifying unlinked (deleted files) that are still consuming space due to open file handles. It even shows the file handle so you can easily reclaim that space.
I've incorporated HTML, ANSI, and bbcode formatting for bold-face headers, with ANSI as the default formatting. I've also provided command-line options, including timeouts, omitting metadata, and the ability to skip reports to save time. There is a list of command-line options below.
If you want to run it on enormous filesystems, I recommend leveraging the command-line switches to turn off reports you do not care about and running it in a screen or tmux session. Alternatively, you can set a timeout so each report will die after the specified time (it can take days to run on mounts with an enormous number of files).
[ Want to test your sysadmin skills? Take a skills assessment today. ]
Try a sample run to view the top five disk consumers on your system. Following is the output, limited to five items per section. (Note: The
5 Largest Directories result displays "total" as an additional entry above the limit count. This is intentional.)
$ sudo ./topdiskconsumer --number 5 #_# BEGIN REPORT == Server Time at start: == Wed Jan 4 13:53:08 EST 2023 == Filesystem Utilization on [ /home ]: == Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/RHEL-Home xfs 100G 45G 56G 45% /home == Inode Information for [ /home ]: == Filesystem Type Inodes IUsed IFree IUse% Mounted on /dev/mapper/RHEL-Home xfs 52428800 483742 51945058 1% /home == Storage device behind directory [ /home ]: == /dev/mapper/RHEL-Home == 5 Largest Files on [ /home ]: == 21G /home/user/VirtualMachines/rhel8.qcow2 618M /home/user/ceph-common.tar 500M /home/user/scratch/bigFile 405M /home/user/Downloads/working/backup.tar 281M /home/user/.local/shareaaaaaaa1f.file == 5 Largest Directories on [ /home ]: == 45G total 45G /home 44G /home/user 21G /home/user/VirtualMachines 18G /home/user/.local/share 18G /home/user/.local == 5 Largest Files on [ /home ] Older Than 30 Days: == 21G /home/user/VirtualMachines/rhel8.qcow2 618M /home/user/ceph-common.tar 500M /home/user/scratch/bigFile 405M /home/user/Downloads/working/backup.tar 281M /home/user/abc123.file == 5 Largest Deleted Files on [ /home ] With Open Handles: == Size COMMAND File Handle Filename 4MB chrome /proc/2728808/fd/14 /home/user/.config/foo.pma == Elapsed Time: == 0h:0m:30s == Server Time at completion: == Wed Jan 4 13:53:38 EST 2023 #_# END REPORT
[Cheat sheet: Old Linux commands and their modern replacements ]
Refine the output with flags
The script identifies the mount point containing the current working directory when executed with no arguments. Then it will execute the report starting from the mount point, identifying and listing the top 20 largest files, the top 20 largest directories, and then the top 20 largest files aged over 30 days.
You can refine the output further with command-line arguments, which you can read more about in the documentation (also available with the
-f) [format]: Format headings as
ansi(for terminals and rich text).
-p) [path]: Set a path, and
topdiskconsumerwill run from that directory's parent mount point.
-l) [number]: Limit the report to the largest files for each report section. The default is 20.
-t) [duration]: Set a timeout for each section of the report. Please note that specifying a timeout may result in incomplete and misleading results.
-o): Skip files more than 30 days old.
-d): Skip the largest directories.
-m): Omit metadata such as reserve blocks, start and end time, or duration.
-u): Skip deleted files with open handles.
-f): Skip the largest files.
I like to use Hungarian notation for my variable and function names, I prefer
grep -E over
egrep (which is just a fancy shell script alias to
grep -E), and I usually end Bash statements with semicolons because it makes my scripts easier to collapse and adapt into a one-liner.
You can access the complete source code in my repository.
[ Get the guide to installing applications on Linux. ]