On 12 Sep 16:12, Jonathan McDowell wrote:
On Fri, Sep 12, 2008 at 03:26:05PM +0100, Ruth Bygrave wrote:
The Mac, and other Unix systems, have fairly good text-munging things for mapping typos to corrected words etc. I use TypeIt4Me for on-the- fly corrections of mistakes I make all the time.
I'm having a go at OCRing a book that fell apart: what's the best 'black box' for regex text munging (i.e. pour file in, get corrected file out?) for patterns that turn up again and again?
Obvious examples from the book include: digits in whitespace between lower-case alpha characters (meaning page numbers); the string 'ddThe' (goodness knows how this happens, but it seems to be a scanning error for ". The"); lower-case letters with a spurious line- break between them; and so on. These are obvious simple regexes without false positives.
Do I: 1) Try to use sed? 2) use something someone has already written in perl or python? or 3) try to write something in applescript?
(posted to Alug because it's much more of a Unix question than a Mac- specific question)
Sounds like you don't need anything more complicated than sed really. Why use a sledgehammer to crack a nut? (Though I can't, OTTOMH, remember how you do multiline matching in sed. I'm sure Brett will though. ;)
Multiline matching (in sed) is a PITA, because you basically have to use a sliding window approach to the problem - it's in the documented examples though! Also, assuming that this is Mac OS X, I wouldn't bet that the version of sed actually takes the options that I'd expect - BSD userland is a right pain. Cheers, -- Brett Parker