On 12 Sep 16:12, Jonathan McDowell wrote:
On Fri, Sep 12, 2008 at 03:26:05PM +0100, Ruth Bygrave wrote:
The Mac, and other Unix systems, have fairly good text-munging things for mapping typos to corrected words etc. I use TypeIt4Me for on-the- fly corrections of mistakes I make all the time.
I'm having a go at OCRing a book that fell apart: what's the best 'black box' for regex text munging (i.e. pour file in, get corrected file out?) for patterns that turn up again and again?
Obvious examples from the book include: digits in whitespace between lower-case alpha characters (meaning page numbers); the string 'ddThe' (goodness knows how this happens, but it seems to be a scanning error for ". The"); lower-case letters with a spurious line- break between them; and so on. These are obvious simple regexes without false positives.
Do I:
- Try to use sed?
- use something someone has already written in perl or python?
or 3) try to write something in applescript?
(posted to Alug because it's much more of a Unix question than a Mac- specific question)
Sounds like you don't need anything more complicated than sed really. Why use a sledgehammer to crack a nut? (Though I can't, OTTOMH, remember how you do multiline matching in sed. I'm sure Brett will though. ;)
Multiline matching (in sed) is a PITA, because you basically have to use a sliding window approach to the problem - it's in the documented examples though!
Also, assuming that this is Mac OS X, I wouldn't bet that the version of sed actually takes the options that I'd expect - BSD userland is a right pain.
Cheers,