Re: [ALUG] SpamAssasin and gocr/jocr

28 Jul 2006


      Tim Green wrote:
...
I have found image spam cuts up the image into blocks, so OCRing the
text will produce odd fragments. OCR will also comsume more CPU than
just filtering on the other parts of the message.
Actually, the more I think about this the more I think that OCR may have
some big benefits. Lots of spam techniques involve various tricks to
embed a message amongst lots of rubbish, but to hide the rubbish from
the viewer. Eg "mess<b></b>age".
If there was a way to render the entire email as it would be seen in an
email client, then OCR that and use that as the starting point of a
filtering mechanism, then most of those techniques would become irrelevant.
As I mentioned elsewhere, OCR accuracy isn't as important as OCR
repeatability.
CPU overhead is by far the biggest issue, but this is probably OK at the
client end (not sure it would scale well to an ISP implementation).
Obviously existing techniques would be used first to weed out obvious
spam, and whitelists could be used to skip the checks applied to
frequent senders of images (I get a lot of images by email, but very few
from people I haven't dealt with before, so only checking the exceptions
would be worthwhile.)
As OCR improved (and it may well gain some development resource off the
back of something like this) the technique would have better results,
and the improved OCR would be useful elsewhere too.
-- 
Mark Rogers
More Solutions Ltd :: 0845 45 89 555

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: [ALUG] SpamAssasin and gocr/jocr