Does anyone have a suggestion for a quick way to process several hundred files, truncating them where ascii text ends and binary data begins?
I'm in the process of recovering data from a corrupt drive, and I've recovered loads of would-be Thunderbird mailboxes (ie mbox files). However, since there is no good way to determine the end of an mbox file from its raw data (that I know of anyway!) what I have is loads of 10M files which start "From - " and end 10MB afterwards. So a 10k mailbox is 10k text and 990k rubbish that was next on the drive, which typically is binary data.
So I need to scan the files, find where the ascii stops and the binary begins, and truncate them at that point.
(Ideally it will be clever enough to allow a few binary chars through in case of any binary chars in the emails, for example unencoded pound signs etc. But something far more basic would be a good start!)
I'm guessing that awk ought to be able to do it, something along the lines of reading line by line, and if the line has more than (say) 5% non-ascii data in it then stop, otherwise print that line and continue. I'm just not sure how to go from pseudo-code to real code (nor whether that was the best place to start anyway).
Mark Rogers
On Thu, Sep 27, 2007 at 07:58:27PM +0100, Mark Rogers wrote:
Does anyone have a suggestion for a quick way to process several hundred files, truncating them where ascii text ends and binary data begins?
I'm in the process of recovering data from a corrupt drive, and I've recovered loads of would-be Thunderbird mailboxes (ie mbox files). However, since there is no good way to determine the end of an mbox file from its raw data (that I know of anyway!) what I have is loads of 10M files which start "From - " and end 10MB afterwards. So a 10k mailbox is 10k text and 990k rubbish that was next on the drive, which typically is binary data.
So I need to scan the files, find where the ascii stops and the binary begins, and truncate them at that point.
(Ideally it will be clever enough to allow a few binary chars through in case of any binary chars in the emails, for example unencoded pound signs etc. But something far more basic would be a good start!)
I'm guessing that awk ought to be able to do it, something along the lines of reading line by line, and if the line has more than (say) 5% non-ascii data in it then stop, otherwise print that line and continue. I'm just not sure how to go from pseudo-code to real code (nor whether that was the best place to start anyway).
Won't formail (part of procmail) do this for you by splitting the file into separate E-Mails? It's trivial to put them back together afterwards and I suspect that formail will just error or ignore the 'binary' junk at the end.