Tim Green wrote:
I have found image spam cuts up the image into blocks, so OCRing the text will produce odd fragments. OCR will also comsume more CPU than just filtering on the other parts of the message.
Actually, the more I think about this the more I think that OCR may have some big benefits. Lots of spam techniques involve various tricks to embed a message amongst lots of rubbish, but to hide the rubbish from the viewer. Eg "mess<b></b>age".
If there was a way to render the entire email as it would be seen in an email client, then OCR that and use that as the starting point of a filtering mechanism, then most of those techniques would become irrelevant.
As I mentioned elsewhere, OCR accuracy isn't as important as OCR repeatability.
CPU overhead is by far the biggest issue, but this is probably OK at the client end (not sure it would scale well to an ISP implementation). Obviously existing techniques would be used first to weed out obvious spam, and whitelists could be used to skip the checks applied to frequent senders of images (I get a lot of images by email, but very few from people I haven't dealt with before, so only checking the exceptions would be worthwhile.)
As OCR improved (and it may well gain some development resource off the back of something like this) the technique would have better results, and the improved OCR would be useful elsewhere too.