The Mac, and other Unix systems, have fairly good text-munging things for mapping typos to corrected words etc. I use TypeIt4Me for on-the- fly corrections of mistakes I make all the time.
I'm having a go at OCRing a book that fell apart: what's the best 'black box' for regex text munging (i.e. pour file in, get corrected file out?) for patterns that turn up again and again?
Obvious examples from the book include: digits in whitespace between lower-case alpha characters (meaning page numbers); the string 'ddThe' (goodness knows how this happens, but it seems to be a scanning error for ". The"); lower-case letters with a spurious line- break between them; and so on. These are obvious simple regexes without false positives.
Do I: 1) Try to use sed? 2) use something someone has already written in perl or python? or 3) try to write something in applescript?
(posted to Alug because it's much more of a Unix question than a Mac- specific question)
Regards, Ruth
On Fri, Sep 12, 2008 at 03:26:05PM +0100, Ruth Bygrave wrote:
The Mac, and other Unix systems, have fairly good text-munging things for mapping typos to corrected words etc. I use TypeIt4Me for on-the- fly corrections of mistakes I make all the time.
I'm having a go at OCRing a book that fell apart: what's the best 'black box' for regex text munging (i.e. pour file in, get corrected file out?) for patterns that turn up again and again?
Obvious examples from the book include: digits in whitespace between lower-case alpha characters (meaning page numbers); the string 'ddThe' (goodness knows how this happens, but it seems to be a scanning error for ". The"); lower-case letters with a spurious line- break between them; and so on. These are obvious simple regexes without false positives.
Do I:
- Try to use sed?
- use something someone has already written in perl or python?
or 3) try to write something in applescript?
(posted to Alug because it's much more of a Unix question than a Mac- specific question)
Sounds like you don't need anything more complicated than sed really. Why use a sledgehammer to crack a nut? (Though I can't, OTTOMH, remember how you do multiline matching in sed. I'm sure Brett will though. ;)
J.
On 12 Sep 16:12, Jonathan McDowell wrote:
On Fri, Sep 12, 2008 at 03:26:05PM +0100, Ruth Bygrave wrote:
The Mac, and other Unix systems, have fairly good text-munging things for mapping typos to corrected words etc. I use TypeIt4Me for on-the- fly corrections of mistakes I make all the time.
I'm having a go at OCRing a book that fell apart: what's the best 'black box' for regex text munging (i.e. pour file in, get corrected file out?) for patterns that turn up again and again?
Obvious examples from the book include: digits in whitespace between lower-case alpha characters (meaning page numbers); the string 'ddThe' (goodness knows how this happens, but it seems to be a scanning error for ". The"); lower-case letters with a spurious line- break between them; and so on. These are obvious simple regexes without false positives.
Do I:
- Try to use sed?
- use something someone has already written in perl or python?
or 3) try to write something in applescript?
(posted to Alug because it's much more of a Unix question than a Mac- specific question)
Sounds like you don't need anything more complicated than sed really. Why use a sledgehammer to crack a nut? (Though I can't, OTTOMH, remember how you do multiline matching in sed. I'm sure Brett will though. ;)
Multiline matching (in sed) is a PITA, because you basically have to use a sliding window approach to the problem - it's in the documented examples though!
Also, assuming that this is Mac OS X, I wouldn't bet that the version of sed actually takes the options that I'd expect - BSD userland is a right pain.
Cheers,
On Friday 12 September 2008 15:26:05 Ruth Bygrave wrote:
I'm having a go at OCRing a book that fell apart:
This is a bit tangentary, but it might be worth having a look at tools like unpaper http://unpaper.berlios.de/ which does tidying up of page scans.
What OCR software are you using? I've heard that Tesseract http://code.google.com/p/tesseract-ocr/ is supposed to be quite good for Mac. (Never tried it myself). Though the installation may be a bit involved http://littlefixes.blogspot.com/2008/06/open-source-ocr-on-mac.html.
Cheers, Richard
On 12 Sep 2008, at 16:13, Richard Lewis wrote:
On Friday 12 September 2008 15:26:05 Ruth Bygrave wrote:
I'm having a go at OCRing a book that fell apart:
This is a bit tangentary, but it might be worth having a look at tools like unpaper http://unpaper.berlios.de/ which does tidying up of page scans.
Not really what I was looking for as I have the text already, but might well be useful in future...
What OCR software are you using? I've heard that Tesseract http://code.google.com/p/tesseract-ocr/ is supposed to be quite good for Mac. (Never tried it myself). Though the installation may be a bit involved http://littlefixes.blogspot.com/2008/06/open-source-ocr-on-mac.html.
I might have a pry at that, but I'm using Omnipage which came with my scanner (kudos to Canon for supplying Mac-appropriate tools).
Am using sed -- the first regex I came up with did not work at all, so I posted it on the Livejournal community for shell scripting, and they pointed out I was using Perl regular expressions instead of sed...
Tried invoking Perl, but as usual I do not have the requisite quantity of dead chickens to coax it with, because it sulked and refused to do magic... (which is the normal result of my invoking perl, and why I distrust the damn thing).
Following which I changed the '\s' whitespace character into '[[space]]' (I think) as the other suggestion went, and sed began to munge the text as desired.
Brett: luckily I don't have to do multiline matching for this. I suspect if I ever have that problem I'll just pour it through sed twice rather than having to pry at the syntax :-)
Regards, Ruth