The Mac, and other Unix systems, have fairly good text-munging things for mapping typos to corrected words etc. I use TypeIt4Me for on-the- fly corrections of mistakes I make all the time.
I'm having a go at OCRing a book that fell apart: what's the best 'black box' for regex text munging (i.e. pour file in, get corrected file out?) for patterns that turn up again and again?
Obvious examples from the book include: digits in whitespace between lower-case alpha characters (meaning page numbers); the string 'ddThe' (goodness knows how this happens, but it seems to be a scanning error for ". The"); lower-case letters with a spurious line- break between them; and so on. These are obvious simple regexes without false positives.
Do I: 1) Try to use sed? 2) use something someone has already written in perl or python? or 3) try to write something in applescript?
(posted to Alug because it's much more of a Unix question than a Mac- specific question)
Regards, Ruth