Has anyone here experimented with defeating imgae-based spam using something like gocr (jocr.sf.net)?
I've seen one attempt (http://wiki.apache.org/spamassassin/OcrPlugin) but I've not tried it (I'm not convinced by the method it uses, since it uses a fixed word list).
My thought was that it ought to be simple to automatically OCR any image attachment and add some headers to the email containing the image text, then have SA check those headers for content (does SA check headers?) That way any words appearing in the image that have previously been seen in spam as plain text will get caught.
It also occurred to me that OCR accuracy might not be too important. After all, if the same image spam is seem multiple times and marked as spam, then the OCR misinterpretations of the words will be common to each scan even if the words found are wrong.
However, I have no real idea how to go about implementing this.