[ALUG] Text munging tool...

12 Sep 2008


      The Mac, and other Unix systems, have fairly good text-munging things  
for mapping typos to corrected words etc. I use TypeIt4Me for on-the- 
fly corrections of mistakes I make all the time.
I'm having a go at OCRing a book that fell apart: what's the best  
'black box' for regex text munging (i.e. pour file in, get corrected  
file out?) for patterns that turn up again and again?
Obvious examples from the book include: digits in whitespace between  
lower-case alpha characters (meaning page numbers); the string  
'ddThe' (goodness knows how this happens, but it seems to be a  
scanning error for ". The"); lower-case letters with a spurious line- 
break between them; and so on. These are obvious simple regexes  
without false positives.
Do I:
    1)	Try to use sed?
    2)	use something someone has already written in perl or python?
or	3)	try to write something in applescript?
(posted to Alug because it's much more of a Unix question than a Mac- 
specific question)
Regards, Ruth

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

[ALUG] Text munging tool...