Mark Rogers wrote:
As I mentioned elsewhere, OCR accuracy isn't as important as OCR repeatability.
FWIW I just ran some quick tests using gocr and Imagemagick:
convert image001.gif pnm:- | gocr -
.. where image001.gif is a sample spam advert.
When comparing the results from a wide sample[*] of images, despite the ocr results being relatively poor in themselves, there was very little variation between successive tests. So if I teach spamassassin that the first is spam it should work out that the next is spam too.
[*] OK it was two images, its all I had to hand. But they were different; one had a completely different background colour, for a start.
CPU overhead is by far the biggest issue, but this is probably OK at the client end (not sure it would scale well to an ISP implementation).
On the low-spec Win2K test PC I had to hand (so I'm pretty sure this is worst case) I could process between 2 and 3 images per second.