Does anyone make use of any tools which can search for duplicate files and (assuming they're on the same filesystem) replace the duplicates with hard links to the first copy of the file?
There seem to be a few scripts around but since it'll be messing with my files I'd rather go by recommendation if I can. (The plan is to free up some space on a server which has lots of duplicate images.)
Also, any "gotchas" that should put me off trying this would be appreciated.
On Tue, 2009-11-24 at 17:09 +0000, Mark Rogers wrote:
Does anyone make use of any tools which can search for duplicate files and (assuming they're on the same filesystem) replace the duplicates with hard links to the first copy of the file?
There seem to be a few scripts around but since it'll be messing with my files I'd rather go by recommendation if I can. (The plan is to free up some space on a server which has lots of duplicate images.)
Also, any "gotchas" that should put me off trying this would be appreciated.
The 'gotcha' is that hard-linked files are not copy-on-write, i.e. if a program opens the file under one of its names and re-writes the contents (or appends to the file) the change happens to the single underlying file under all of its names. As long as that is what you are expecting to happen then all is well.
When I am happy this will not cause a problem I have been using a home-grown program which I tried attaching and then this message got held for approval so instead, if you are interested, it is available at: http://pelvoux.gotadsl.co.uk/dupfind.c
Regards, Steve.
Mark Rogers wrote:
Does anyone make use of any tools which can search for duplicate files and (assuming they're on the same filesystem) replace the duplicates with hard links to the first copy of the file?
There seem to be a few scripts around but since it'll be messing with my files I'd rather go by recommendation if I can. (The plan is to free up some space on a server which has lots of duplicate images.)
Also, any "gotchas" that should put me off trying this would be appreciated.
fdupes will do the "find the duplicate files" bit, it doesn't even care if the filenames are different as it checksums the contents. Better than a straight search on a filename as you may have multiple versions with different contents.
I know fdupes has a mode where it can delete duplicates...but I'd imagine it is a fairly trivial exercise in scripting to get it to produce links to a master instead.
Although take heed of Steve's comment, if duplicates exist you have to ask yourself why...perhaps if we are talking on a fileserver it is because someone wanted a local copy they can edit without impacting everyone else...start replacing these for hard links and you are going to upset people.
Wayne Stallwood wrote:
Although take heed of Steve's comment, if duplicates exist you have to ask yourself why...perhaps if we are talking on a fileserver it is because someone wanted a local copy they can edit without impacting everyone else...start replacing these for hard links and you are going to upset people.
Steve's comment was very relevant, I did know that but had largely forgotten it so part of my desire for using it is shot to pieces!
That said, it's not uncommon around these parts for someone to duplicate a website to do work on it, where a large volume of product images are included that are not edited and that directory can represent 90+% of the total size of the copy. There are better alternatives to doing the copy of-course, and I think maybe I should look at that before I start linking copies of images.