Duplicated files in the pristine FC4t2 installation

Roland McGrath roland at redhat.com
Mon May 2 20:27:08 UTC 2005


> But I think the whole problem is silly as well, FWIW.

When Warren brought this up on IRC a while back, I wrote the following
script and rand it on a rawhide everything install.  This fails to take
into account files that are already hardlinked, so and its results might
well be significantly inflated.  (Someone who cares could hack it further
to check installed names of a duplicate file for being the same inode.)

Total 408578931 bytes in 43107 inodes

That's a max of < 400M on an install that is something 8.5-9G.
So the issue is worth at most on the order of 5% of disk space,
and that is probably a very high estimate.


rpm -qa --qf '[%{FILEMD5S}  %{FILENAMES} %{FILESIZES} %{SOURCERPM}\n]' |
awk '
NF < 4 { next } # directory
{
  md5_name[$1] = $2;
  md5_srpm[$1] = $4;
  info = $2 " " $4;
  if ($1 in sizes) {
    if ($3 != sizes[$1]) print "!!!", $1 ":", info, "VS", md5[info]
  } else {
    sizes[$1] = $3;
  }
  if ($1 in md5) {
    if (info == md5[$1]) next;
    for (i = 1; i < dups[$1]; ++i)
      if (dupinfo[$1 "," i] == info)
        next;
    dups[$1]++;
    dupinfo[$1 "," dups[$1]] = info;
  } else {
    md5[$1] = info;
  }
}
END {
  dupsize = dupcount = 0;
  for (sum in dups) {
    n = dups[sum];
    dupcount += n;
    dupsize += n * sizes[sum];
    print n, "dups:", sum, " ==> ", (n * sizes[sum]);
    print "\t" md5[sum];
    for (i = 1; i <= n; ++i)
      print "\t" dupinfo[sum "," i];
  }
  print "Total", dupsize, "bytes in", dupcount, "inodes";
}
'




More information about the fedora-devel-list mailing list