On 12 Sep 2008, at 16:13, Richard Lewis wrote:
On Friday 12 September 2008 15:26:05 Ruth Bygrave wrote:
I'm having a go at OCRing a book that fell apart:
This is a bit tangentary, but it might be worth having a look at tools like unpaper http://unpaper.berlios.de/ which does tidying up of page scans.
Not really what I was looking for as I have the text already, but might well be useful in future...
What OCR software are you using? I've heard that Tesseract http://code.google.com/p/tesseract-ocr/ is supposed to be quite good for Mac. (Never tried it myself). Though the installation may be a bit involved http://littlefixes.blogspot.com/2008/06/open-source-ocr-on-mac.html.
I might have a pry at that, but I'm using Omnipage which came with my scanner (kudos to Canon for supplying Mac-appropriate tools).
Am using sed -- the first regex I came up with did not work at all, so I posted it on the Livejournal community for shell scripting, and they pointed out I was using Perl regular expressions instead of sed...
Tried invoking Perl, but as usual I do not have the requisite quantity of dead chickens to coax it with, because it sulked and refused to do magic... (which is the normal result of my invoking perl, and why I distrust the damn thing).
Following which I changed the '\s' whitespace character into '[[space]]' (I think) as the other suggestion went, and sed began to munge the text as desired.
Brett: luckily I don't have to do multiline matching for this. I suspect if I ever have that problem I'll just pour it through sed twice rather than having to pry at the syntax :-)
Regards, Ruth