I have an "images" directory with several subdirectories (and sub-sub-directories), one of them called "products".
Pretty much every image in the "products" directory is a duplicate of an image in the parent "images" directory or one of its other subdirectories, although with a different filename (binary contents will be the same, and there's a good chance that the timestamp will also be the same although this is not guaranteed).
I want to locate the duplicates, and remove them - the ones I must keep are the ones in the "products" subdirectory. Ideally I'd like to move the duplicates into a temporary directory (retaining their sub-directory paths) rather than delete them in case I need to restore them, but deletion wouldn't be too bad (I can take backups first, although it's on a hosted server and the "images" directory is 1.1G, with free space on the server a little less than that, so just duplicating the directory won't be as easy as it might sound - the lack of free space is one of the main reasons I need to do this!).
Oh, and there are some sub-directories I'd like to exclude.
So: Which tools should I look at? fslint springs to mind but I've never really used it in anger and the documentation seems a bit vague. The GUI option is out because there's no X on the server. Are there other, better (more appropriate) tools?
Mark Rogers wrote:
I want to locate the duplicates, and remove them - [...] Oh, and there are some sub-directories I'd like to exclude.
So: Which tools should I look at? fslint springs to mind but I've never really used it in anger and the documentation seems a bit vague. The GUI option is out because there's no X on the server. Are there other, better (more appropriate) tools?
I'd be looking at scripting something with find, its -exec option, md5sum, sort, mv and maybe ln if I want to keep the filenames, but I've not looked at fslint.
Hope that helps,
On 18/06/10 10:36, MJ Ray wrote:
I'd be looking at scripting something with find, its -exec option, md5sum, sort, mv and maybe ln if I want to keep the filenames, but I've not looked at fslint.
The more I think about it the more I am thinking that I should go down this route. It will be a one-off process, so if it takes overnight to run, it doesn't really matter, and that way I can control what it does.
However, if a standard tool (like fslint) *can* do it, then I think it's the kind of tool I should learn to use. After all, I could write a script to find the files without using "find", but it's a good tool for that job and I'm glad I have learnt to use it.
On 18-Jun-10 09:42:47, Mark Rogers wrote:
On 18/06/10 10:36, MJ Ray wrote:
I'd be looking at scripting something with find, its -exec option, md5sum, sort, mv and maybe ln if I want to keep the filenames, but I've not looked at fslint.
The more I think about it the more I am thinking that I should go down this route. It will be a one-off process, so if it takes overnight to run, it doesn't really matter, and that way I can control what it does.
However, if a standard tool (like fslint) *can* do it, then I think it's the kind of tool I should learn to use. After all, I could write a script to find the files without using "find", but it's a good tool for that job and I'm glad I have learnt to use it.
--
Possbly useful in locating the duplicates (indeed using 'find') may be:
for i in `find . -type f -print` ; do ls -lgG --block-size=1 $i | awk -v F=$i '{S=$3};{print S " " $5 " " F}' ; done | sort -n
(note the back-quotes around the "find ... ").
This will produce a listing of all the files in or below the current directory, with their size in bytes in first position, followed by timestamp (hh:mm) in case that is useful, followed by the full pathname of the file, and sorted by increasing order of file size.
Thus duplicate files (i.e. with identical binary content) will have identical file sizes and so will be listed adjacent to each other (with the possible exception of other files which happen to have exactly the same file size -- though this is unlikely for image files).
If you know that all the image files have the same (e.g. ".jpg") extension, then the 'find' part could be replaced by
`find . -name '*.jpg' -print`
(or equivalent for other extensions). Note the necessity to put the "*.jpg" within ordinary single quotes (in addition to the back-quotes round the whole thing), since this is a regular expression to be passed as-is to find without being interpreted by the shell.
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@manchester.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 18-Jun-10 Time: 11:31:47 ------------------------------ XFMail ------------------------------
On 18/06/10 11:31, (Ted Harding) wrote:
Possbly useful in locating the duplicates (indeed using 'find') may be:
for i in `find . -type f -print` ; do ls -lgG --block-size=1 $i | awk -v F=$i '{S=$3};{print S " " $5 " " F}' ; done | sort -n
(note the back-quotes around the "find ... ").
Thanks for this (and the detailed explanation).
One thing it fails on is filenames with spaces in them, of which (unfortunately) it seems I have rather a lot....
I can make the "find" handle them using -print0 instead of -print, but I don't see how to make "for" cope with the resulting filelist. However, I assume I can go with something along the lines of:
find . -type f -print0 | xargs -0 -l ls -lgG --block-size=1 "{}" | awk -v F=$i '{S=$3};{print S " " $5 " " F}' | sort -n
.. but that doesn't work as it stands, indeed even this bit: find . -type f -print0 | xargs -0 -l ls -lgG --block-size=1 "{}" .. fails (ls: cannot access "./foo.png"), although I can't see why (ls "./foo.png" works fine).
On 18 Jun 11:31, Ted Harding wrote:
On 18-Jun-10 09:42:47, Mark Rogers wrote:
On 18/06/10 10:36, MJ Ray wrote:
I'd be looking at scripting something with find, its -exec option, md5sum, sort, mv and maybe ln if I want to keep the filenames, but I've not looked at fslint.
The more I think about it the more I am thinking that I should go down this route. It will be a one-off process, so if it takes overnight to run, it doesn't really matter, and that way I can control what it does.
However, if a standard tool (like fslint) *can* do it, then I think it's the kind of tool I should learn to use. After all, I could write a script to find the files without using "find", but it's a good tool for that job and I'm glad I have learnt to use it.
--
Possbly useful in locating the duplicates (indeed using 'find') may be:
for i in `find . -type f -print` ; do ls -lgG --block-size=1 $i | awk -v F=$i '{S=$3};{print S " " $5 " " F}' ; done | sort -n
Why not:
find . -type f -printf "%10s %AY-%Am-%Ad %AH:%AM %p\n" | sort -n
Less pipes, less to go wrong, no need for the awk...
Ta,
On 18/06/10 12:05, Brett Parker wrote:
Why not:
find . -type f -printf "%10s %AY-%Am-%Ad %AH:%AM %p\n" | sort -n
Less pipes, less to go wrong, no need for the awk...
Ah, much cleaner, thanks!
I'm actually now using: find . -type f -printf "%10s %AY-%Am-%Ad %AH:%AM:%AS %p\n" \ | grep -vE " ./(foo|bar|etc)/" | sort -n .. where foo, bar and etc are three directories I want to exclude from the results.
Next question: if I want to move "./foo/bar.png" into an "old" directory, retaining the path, is there a good way to do this (assuming the paths may not already exist)?
Ie, I want ./foo/bar.png to end up as ./old/foo/bar.png
Mark Rogers wrote:
I'm actually now using: find . -type f -printf "%10s %AY-%Am-%Ad %AH:%AM:%AS %p\n" \ | grep -vE " ./(foo|bar|etc)/" | sort -n .. where foo, bar and etc are three directories I want to exclude from the results.
Next question: if I want to move "./foo/bar.png" into an "old" directory, retaining the path, is there a good way to do this (assuming the paths may not already exist)?
Ie, I want ./foo/bar.png to end up as ./old/foo/bar.png
Something like cpio -p --make-directories --preserve-modification-time old < name-list perhaps?
Hope that helps,
On 18/06/10 14:34, MJ Ray wrote:
Something like cpio -p --make-directories --preserve-modification-time old< name-list perhaps?
A useful one to know about certainly (never used cpio) but as far as I can tell this only allows me to duplicate the files in the new location, not move them there?
One thought that this prompts though, in talking about archives, is that I can generate a file list and use this with tar to archive the files with --remove-files to take them away from the old location.
Mark Rogers wrote:
On 18/06/10 14:34, MJ Ray wrote:
Something like cpio -p --make-directories --preserve-modification-time old< name-list perhaps?
A useful one to know about certainly (never used cpio) but as far as I can tell this only allows me to duplicate the files in the new location, not move them there?
Yes, true, you'd need to feed name-list to xargs rm afterwards.
One thought that this prompts though, in talking about archives, is that I can generate a file list and use this with tar to archive the files with --remove-files to take them away from the old location.
My first thought was to reach for tar, but that wasn't quite what was asked.
Hope that helps,
On 19/06/10 03:03, MJ Ray wrote:
My first thought was to reach for tar, but that wasn't quite what was asked.
I would certainly prefer the images to be moved to a new directory, but I haven't found a good way to do that yet!
Something along the lines of tar -c --remove-files some/file/with/path | tar -x .. would probably do what I need (doing it one file at a time avoids the need to duplicate several hundred MB of files during the process). (Not tested, and I'd need some additional options to put the destination path in there somewhere.)
On 18-Jun-10 11:05:45, Brett Parker wrote:
On 18 Jun 11:31, Ted Harding wrote:
On 18-Jun-10 09:42:47, Mark Rogers wrote:
On 18/06/10 10:36, MJ Ray wrote:
I'd be looking at scripting something with find, its -exec option, md5sum, sort, mv and maybe ln if I want to keep the filenames, but I've not looked at fslint.
The more I think about it the more I am thinking that I should go down this route. It will be a one-off process, so if it takes overnight to run, it doesn't really matter, and that way I can control what it does.
However, if a standard tool (like fslint) *can* do it, then I think it's the kind of tool I should learn to use. After all, I could write a script to find the files without using "find", but it's a good tool for that job and I'm glad I have learnt to use it.
--
Possbly useful in locating the duplicates (indeed using 'find') may be:
for i in `find . -type f -print` ; do ls -lgG --block-size=1 $i | awk -v F=$i '{S=$3};{print S " " $5 " " F}' ; done | sort -n
Why not:
find . -type f -printf "%10s %AY-%Am-%Ad %AH:%AM %p\n" | sort -n
Less pipes, less to go wrong, no need for the awk...
Ta,
Brett Parker
That's neat! However, it seems to depend on whatever "%p" comes out as, and I can't find any documentation about it! Where should I look?
Ted.
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@manchester.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 18-Jun-10 Time: 12:49:13 ------------------------------ XFMail ------------------------------
See at end!
On 18-Jun-10 11:49:16, Ted Harding wrote:
On 18-Jun-10 11:05:45, Brett Parker wrote:
On 18 Jun 11:31, Ted Harding wrote:
On 18-Jun-10 09:42:47, Mark Rogers wrote:
On 18/06/10 10:36, MJ Ray wrote:
I'd be looking at scripting something with find, its -exec option, md5sum, sort, mv and maybe ln if I want to keep the filenames, but I've not looked at fslint.
The more I think about it the more I am thinking that I should go down this route. It will be a one-off process, so if it takes overnight to run, it doesn't really matter, and that way I can control what it does.
However, if a standard tool (like fslint) *can* do it, then I think it's the kind of tool I should learn to use. After all, I could write a script to find the files without using "find", but it's a good tool for that job and I'm glad I have learnt to use it.
--
Possbly useful in locating the duplicates (indeed using 'find') may be:
for i in `find . -type f -print` ; do ls -lgG --block-size=1 $i | awk -v F=$i '{S=$3};{print S " " $5 " " F}' ; done | sort -n
Why not:
find . -type f -printf "%10s %AY-%Am-%Ad %AH:%AM %p\n" | sort -n
Less pipes, less to go wrong, no need for the awk...
Ta,
Brett Parker
That's neat! However, it seems to depend on whatever "%p" comes out as, and I can't find any documentation about it! Where should I look?
Ted.
Ah, found it!!! See the option "-printf" in 'man find'.
(bangs head on desk). Ted.
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@manchester.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 18-Jun-10 Time: 12:57:51 ------------------------------ XFMail ------------------------------
IPhone soz about top posting
Did the find work well on finding duplicate files, I have allot of picture files that I need to sort out and delete duplicates.
Regards Ian Porter
www : www.codingfriends.com
On 18 Jun 2010, at 12:49, (Ted Harding) Ted.Harding@manchester.ac.uk wrote:
On 18-Jun-10 11:05:45, Brett Parker wrote:
On 18 Jun 11:31, Ted Harding wrote:
On 18-Jun-10 09:42:47, Mark Rogers wrote:
On 18/06/10 10:36, MJ Ray wrote:
I'd be looking at scripting something with find, its -exec option, md5sum, sort, mv and maybe ln if I want to keep the filenames, but I've not looked at fslint.
The more I think about it the more I am thinking that I should go down this route. It will be a one-off process, so if it takes overnight to run, it doesn't really matter, and that way I can control what it does.
However, if a standard tool (like fslint) *can* do it, then I think it's the kind of tool I should learn to use. After all, I could write a script to find the files without using "find", but it's a good tool for that job and I'm glad I have learnt to use it.
--
Possbly useful in locating the duplicates (indeed using 'find') may be:
for i in `find . -type f -print` ; do ls -lgG --block-size=1 $i | awk -v F=$i '{S=$3};{print S " " $5 " " F}' ; done | sort -n
Why not:
find . -type f -printf "%10s %AY-%Am-%Ad %AH:%AM %p\n" | sort -n
Less pipes, less to go wrong, no need for the awk...
Ta,
Brett Parker
That's neat! However, it seems to depend on whatever "%p" comes out as, and I can't find any documentation about it! Where should I look?
Ted.
E-Mail: (Ted Harding) Ted.Harding@manchester.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 18-Jun-10 Time: 12:49:13 ------------------------------ XFMail ------------------------------
main@lists.alug.org.uk http://www.alug.org.uk/ http://lists.alug.org.uk/mailman/listinfo/main Unsubscribe? See message headers or the web site above!
On 18/06/10 14:23, Ian Porter wrote:
Did the find work well on finding duplicate files, I have allot of picture files that I need to sort out and delete duplicates.
The find approach as described just gets me a list of files to consider, sorted by filesize. I still need to write a script to go through each group of files that have the same filesize, and use md5sum to compare them, and then to decide which of the set of files I want to keep and which to remove....
For what you're describing, I think that fslint will actually be better, it's closer to what it is designed for.