Eliminating duplicate photos

Tue Sep 30 10:56:28 UTC 2008

Nifty Fedora Mitch wrote:
> On Mon, Sep 29, 2008 at 02:09:05PM -0430, Patrick O'Callaghan wrote:
>   
>> On Mon, 2008-09-29 at 14:00 -0400, Trapper wrote:
>>     
>>> Itamar - IspBrasil wrote:
>>>       
>>>> create a list of md5 of all files,
>>>>
>>>> with md5 you will find duplicated files.
>>>>
>>>> On 9/29/2008 9:04 AM, Timothy Murphy wrote:
>>>>         
>>>>> What is the best way of eliminating duplicate photos
>>>>> on a number of machines, all running Fedora or CentOS?
>>>>>
>>>>> I suppose one could ask the same question about files generally;
>>>>> how to tag or delete duplicates.
>>>>>
>>>>>    
>>>>>           
>>> I have a problem similar to Timothy's. If I run "md5sum *" on a folder, 
>>> in a terminal,  it lists all the sums. My problem is that I have several 
>>> thousand files. Is there some way I can output the results to a text 
>>> file? Can't copy and paste unless there's some way for me to adjust the 
>>> terminal to allow the last several thousand lines to display. Then I'm 
>>> also going to have to sort all those lines into some alphabetical order 
>>> to reasonably detect duplicate sums. Any ideas?
>>>       
>> You're using Linux here. Anything that outputs text to a terminal can
>> send it to a file or to another program. You need to read up on Shell
>> redirection and filters, e.g.:
>>
>> md5sum * > sums
>>
>> or
>>
>> md5sum * | sort > sorted_sums
>>
>>     
>
> The below script is not very general but can be edited to 
> your need.   The SIZER value is to make it easy to find lumpy
> things like duplicate ISO images.   The odd md5sum value 
> pops up often for interesting reasons and is excluded.
>
> ============================================================
> #!  /bin/bash
> # Copyright (C) 1985-2008 by Tom Mitchell 
> #
> # This program is free software, licensed under the GNU GPL, >=2.0. http://www.gnu.org/.
> # This software comes with absolutely NO WARRANTY. Use at your own risk!
> #
> #SIZER=' -size +10240k'
> SIZER=' -size +0'
> #
> DIRLIST=". "
> find $DIRLIST  -type f $SIZER -print0 | xargs -0 md5sum |\
> 	egrep -v "d41d8cd98f00b204e9800998ecf8427e|LemonGrassWigs" |\
> sort > /tmp/looking4duplicates
> tput bel; sleep 2
> tput bel; sleep 2
> tput bel; sleep 2
> cat /tmp/looking4duplicates |  uniq --check-chars=32 --all-repeated=prepend | less
>
>
>   
My thanks to those that provided me with some suggestions, direction and 
study hall tips. Either of the procedures listed above does the trick 
for me, as does fslint.

Trapper