Re: [ALUG] Text munging tool...

16 Sep 2008

      On 12 Sep 2008, at 16:13, Richard Lewis wrote:
...
On Friday 12 September 2008 15:26:05 Ruth Bygrave wrote:
...
I'm having a go at OCRing a book that fell apart:
This is a bit tangentary, but it might be worth having a look at
tools like unpaper <http://unpaper.berlios.de/> which does tidying
up of page scans.
Not really what I was looking for as I have the text already, but  
might well be useful in future...
What OCR software are you using? I've heard that Tesseract
<http://code.google.com/p/tesseract-ocr/> is supposed to be quite
good for Mac. (Never tried it myself). Though the installation may
be a bit involved
<http://littlefixes.blogspot.com/2008/06/open-source-ocr-on-mac.html>.
I might have a pry at that, but I'm using Omnipage which came with my  
scanner (kudos to Canon for supplying Mac-appropriate tools).
Am using sed -- the first regex I came up with did not work at all, so  
I posted it on the Livejournal community for shell scripting, and they  
pointed out I was using Perl regular expressions instead of sed...

Tried invoking Perl, but as usual I do not have the requisite quantity  
of dead chickens to coax it with, because it sulked and refused to do  
magic... (which is the normal result of my invoking perl, and why I  
distrust the damn thing).

Following which I changed the '\s' whitespace character into  
'[[space]]' (I think) as the other suggestion went, and sed began to  
munge the text as desired.

Brett: luckily I don't have to do multiline matching for this. I  
suspect if I ever have that problem I'll just pour it through sed  
twice rather than having to pry at the syntax :-)

Regards, Ruth