Truncating text files at non-ascii - main

27 Sep 2007


      Does anyone have a suggestion for a quick way to process several hundred 
files, truncating them where ascii text ends and binary data begins?
I'm in the process of recovering data from a corrupt drive, and I've 
recovered loads of would-be Thunderbird mailboxes (ie mbox files). 
However, since there is no good way to determine the end of an mbox file 
from its raw data (that I know of anyway!) what I have is loads of 10M 
files which start "From - " and end 10MB afterwards. So a 10k mailbox is 
10k text and 990k rubbish that was next on the drive, which typically is 
binary data.
So I need to scan the files, find where the ascii stops and the binary 
begins, and truncate them at that point.
(Ideally it will be clever enough to allow a few binary chars through in 
case of any binary chars in the emails, for example unencoded pound 
signs etc. But something far more basic would be a good start!)
I'm guessing that awk ought to be able to do it, something along the 
lines of reading line by line, and if the line has more than (say) 5% 
non-ascii data in it then stop, otherwise print that line and continue. 
I'm just not sure how to go from pseudo-code to real code (nor whether 
that was the best place to start anyway).
Mark Rogers