Re: [ALUG] Text munging tool...

15 Sep 2008

      On 12 Sep 16:12, Jonathan McDowell wrote:
...
On Fri, Sep 12, 2008 at 03:26:05PM +0100, Ruth Bygrave wrote:
...
The Mac, and other Unix systems, have fairly good text-munging things  
for mapping typos to corrected words etc. I use TypeIt4Me for on-the- 
fly corrections of mistakes I make all the time.
I'm having a go at OCRing a book that fell apart: what's the best  
'black box' for regex text munging (i.e. pour file in, get corrected  
file out?) for patterns that turn up again and again?
Obvious examples from the book include: digits in whitespace between  
lower-case alpha characters (meaning page numbers); the string  
'ddThe' (goodness knows how this happens, but it seems to be a  
scanning error for ". The"); lower-case letters with a spurious line- 
break between them; and so on. These are obvious simple regexes  
without false positives.
Do I:
  1)	Try to use sed?
  2)	use something someone has already written in perl or python?
or	3)	try to write something in applescript?
(posted to Alug because it's much more of a Unix question than a Mac- 
specific question)
Sounds like you don't need anything more complicated than sed really.
Why use a sledgehammer to crack a nut? (Though I can't, OTTOMH, remember
how you do multiline matching in sed. I'm sure Brett will though. ;)
Multiline matching (in sed) is a PITA, because you basically have to use
a sliding window approach to the problem - it's in the documented
examples though!

Also, assuming that this is Mac OS X, I wouldn't bet that the version of
sed actually takes the options that I'd expect - BSD userland is a right
pain.

Cheers,
-- 
Brett Parker