Mark Rogers wrote:
As I mentioned elsewhere, OCR accuracy isn't as important as OCR repeatability.
FWIW I just ran some quick tests using gocr and Imagemagick: convert image001.gif pnm:- | gocr - .. where image001.gif is a sample spam advert. When comparing the results from a wide sample[*] of images, despite the ocr results being relatively poor in themselves, there was very little variation between successive tests. So if I teach spamassassin that the first is spam it should work out that the next is spam too. [*] OK it was two images, its all I had to hand. But they were different; one had a completely different background colour, for a start.
CPU overhead is by far the biggest issue, but this is probably OK at the client end (not sure it would scale well to an ISP implementation).
On the low-spec Win2K test PC I had to hand (so I'm pretty sure this is worst case) I could process between 2 and 3 images per second. -- Mark Rogers More Solutions Ltd :: 0845 45 89 555