SpamAssasin and gocr/jocr

List overview All Threads
Download

newer

older

wireless community network in...

Weekly IRC reminder

Mark Rogers

26 Jul 2006 26 Jul '06

10:36 p.m.

Has anyone here experimented with defeating imgae-based spam using something like gocr (jocr.sf.net)?

I've seen one attempt (http://wiki.apache.org/spamassassin/OcrPlugin) but I've not tried it (I'm not convinced by the method it uses, since it uses a fixed word list).

My thought was that it ought to be simple to automatically OCR any image attachment and add some headers to the email containing the image text, then have SA check those headers for content (does SA check headers?) That way any words appearing in the image that have previously been seen in spam as plain text will get caught.

It also occurred to me that OCR accuracy might not be too important. After all, if the same image spam is seem multiple times and marked as spam, then the OCR misinterpretations of the words will be common to each scan even if the words found are wrong.

However, I have no real idea how to go about implementing this.

-- Mark Rogers More Solutions Ltd :: 0845 45 89 555

Show replies by date

Brett Parker

27 Jul 27 Jul

8:46 a.m.

On Wed, Jul 26, 2006 at 10:36:28PM +0100, Mark Rogers wrote:

...

Has anyone here experimented with defeating imgae-based spam using something like gocr (jocr.sf.net)?

Far easier solution: don't accept e-mail with attachments

So, no HTML e-mail, no image attachments, sorted! (OK, so you lose a few useful e-mails, but they were from morons that couldn't send plain text mail, so they can sod off, right?).

*grin*,

-- Brett Parker

Mark Rogers

9:04 a.m.

Brett Parker wrote:

...

Far easier solution: don't accept e-mail with attachments

So, no HTML e-mail, no image attachments, sorted! (OK, so you lose a few useful e-mails, but they were from morons that couldn't send plain text mail, so they can sod off, right?).

If only! Unfortunately if I stopped doing business with morons I'd have no customers left :-)

-- Mark Rogers More Solutions Ltd :: 0845 45 89 555

Wayne Stallwood

9:16 a.m.

On Thu, 2006-07-27 at 08:46 +0100, Brett Parker wrote:

...

Far easier solution: don't accept e-mail with attachments

Wouldn't work for me, there are too many legitimate reasons why somebody would need to send me an e-mail with an attachment.

Ted.Harding＠nessie.mcc.ac.uk

9:33 a.m.

On 27-Jul-06 Wayne Stallwood wrote:

...

On Thu, 2006-07-27 at 08:46 +0100, Brett Parker wrote:

...
Far easier solution: don't accept e-mail with attachments

Wouldn't work for me, there are too many legitimate reasons why somebody would need to send me an e-mail with an attachment.

I have to agree with this too! The whole point of MIME and attachments is to provide a mechanism for transmitting information of kinds which plain-text mail is not suitable for. People need to be able to send and receive such information.

I'm wondering a bit why Mark (OP) is seeking such a solution (interesting idea though it is). I receive on average over 1000 emails a day, about 2/3 of which are pure spam (and at least half the rest are mailing-list postings which I'm not interested in reading). I have no particular difficulty, and very close to 100 per cent success, in deleting these on the receiving server before I donwload the rest, purely on the basis of apparent sender, and subject line. Total time for this: 15-20min/day.

Best wishes to all, Ted.

-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@nessie.mcc.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 27-Jul-06 Time: 09:33:34 ------------------------------ XFMail ------------------------------

Brett Parker

9:48 a.m.

On Thu, Jul 27, 2006 at 09:33:37AM +0100, Ted Harding wrote:

...

On 27-Jul-06 Wayne Stallwood wrote:

...
On Thu, 2006-07-27 at 08:46 +0100, Brett Parker wrote:

...
Far easier solution: don't accept e-mail with attachments

Wouldn't work for me, there are too many legitimate reasons why somebody would need to send me an e-mail with an attachment.

I have to agree with this too! The whole point of MIME and attachments is to provide a mechanism for transmitting information of kinds which plain-text mail is not suitable for. People need to be able to send and receive such information.

Fine for the back then when webspace wasn't readily available and hosting was extorsionate, if you're sending the same e-mail with the same attachment to multiple people, though, and some of those just aren't going to read it, including a URL to the relevant document is much nicer.

...

I'm wondering a bit why Mark (OP) is seeking such a solution (interesting idea though it is). I receive on average over 1000 emails a day, about 2/3 of which are pure spam (and at least half the rest are mailing-list postings which I'm not interested in reading). I have no particular difficulty, and very close to 100 per cent success, in deleting these on the receiving server before I donwload the rest, purely on the basis of apparent sender, and subject line. Total time for this: 15-20min/day.

That's 10 to 15 mins more than I spend on spam a day ;) Cron mails, now that's a different story (yay for a whole set of network booting workstations :).

Cheers,

-- Brett Parker

Mark Rogers

3:38 p.m.

Brett Parker wrote:

...

Fine for the back then when webspace wasn't readily available and hosting was extorsionate, if you're sending the same e-mail with the same attachment to multiple people, though, and some of those just aren't going to read it, including a URL to the relevant document is much nicer.

Often its not one email to lots of people, its one email to one person. And my customers would have no idea how to upload to web/ftp/etc.

...

That's 10 to 15 mins more than I spend on spam a day ;) Cron mails, now that's a different story (yay for a whole set of network booting workstations :).

Just to be clear: This is not about trying to save time. I'm sure I'm no different from many here in that I'd rather spend a day saving an hour if the day is more interesting.

-- Mark Rogers More Solutions Ltd :: 0845 45 89 555

Mark Rogers

2:07 p.m.

(Ted Harding) wrote:

...

On 27-Jul-06 Wayne Stallwood wrote:

I'm wondering a bit why Mark (OP) is seeking such a solution (interesting idea though it is). I receive on average over 1000 emails a day, about 2/3 of which are pure spam (and at least half the rest are mailing-list postings which I'm not interested in reading). I have no particular difficulty, and very close to 100 per cent success, in deleting these on the receiving server before I donwload the rest, purely on the basis of apparent sender, and subject line. Total time for this: 15-20min/day.

Two reasons. One is pure annoyance factor. Second is that I'm interested in the technical challenge. Third is that I have customers who would pay for it.

OK that's three reasons. And I could probably think of some more!

It is annoying getting the exact same spam several times a day and having anti-spam miss it completely.

It did cross my mind that at the email level the MIME encoding is just lots of strings of text, which if the image is identical same each time should be sufficient (I assume SA doesn't look at it though). But that does assume the images are identical; one pixel change might be enough to break it.

-- Mark Rogers More Solutions Ltd :: 0845 45 89 555

David Reynolds

2:16 p.m.

On 27 Jul 2006, at 2:07 pm, Mark Rogers wrote:

...

It is annoying getting the exact same spam several times a day and having anti-spam miss it completely.

Have you tried using spamassassin and sa-learn to train it what is spam?

Regards,

David

-- David Reynolds david@reynoldsfamily.org.uk

Mark Rogers

3:40 p.m.

David Reynolds wrote:

...

Have you tried using spamassassin and sa-learn to train it what is spam?

Unless I'm missing something (which is quite possible) this won't help with the image spam that I'm trying to work on, unless I can convert the image text to real text.

-- Mark Rogers More Solutions Ltd :: 0845 45 89 555

Tim Green

6:30 p.m.

On 7/27/06, Mark Rogers mark@quarella.co.uk wrote:

...

David Reynolds wrote:

...
Have you tried using spamassassin and sa-learn to train it what is spam?

Unless I'm missing something (which is quite possible) this won't help with the image spam that I'm trying to work on, unless I can convert the image text to real text.

I have found image spam cuts up the image into blocks, so OCRing the text will produce odd fragments. OCR will also comsume more CPU than just filtering on the other parts of the message.

Tim.

Mark Rogers

28 Jul 28 Jul

11:58 a.m.

Tim Green wrote:

...

I have found image spam cuts up the image into blocks, so OCRing the text will produce odd fragments. OCR will also comsume more CPU than just filtering on the other parts of the message.

Actually, the more I think about this the more I think that OCR may have some big benefits. Lots of spam techniques involve various tricks to embed a message amongst lots of rubbish, but to hide the rubbish from the viewer. Eg "mess<b></b>age".

If there was a way to render the entire email as it would be seen in an email client, then OCR that and use that as the starting point of a filtering mechanism, then most of those techniques would become irrelevant.

As I mentioned elsewhere, OCR accuracy isn't as important as OCR repeatability.

CPU overhead is by far the biggest issue, but this is probably OK at the client end (not sure it would scale well to an ISP implementation). Obviously existing techniques would be used first to weed out obvious spam, and whitelists could be used to skip the checks applied to frequent senders of images (I get a lot of images by email, but very few from people I haven't dealt with before, so only checking the exceptions would be worthwhile.)

As OCR improved (and it may well gain some development resource off the back of something like this) the technique would have better results, and the improved OCR would be useful elsewhere too.

-- Mark Rogers More Solutions Ltd :: 0845 45 89 555

Mark Rogers

31 Jul 31 Jul

11:20 p.m.

Mark Rogers wrote:

...

As I mentioned elsewhere, OCR accuracy isn't as important as OCR repeatability.

FWIW I just ran some quick tests using gocr and Imagemagick:

convert image001.gif pnm:- | gocr -

.. where image001.gif is a sample spam advert.

When comparing the results from a wide sample[*] of images, despite the ocr results being relatively poor in themselves, there was very little variation between successive tests. So if I teach spamassassin that the first is spam it should work out that the next is spam too.

[*] OK it was two images, its all I had to hand. But they were different; one had a completely different background colour, for a start.

...

CPU overhead is by far the biggest issue, but this is probably OK at the client end (not sure it would scale well to an ISP implementation).

On the low-spec Win2K test PC I had to hand (so I'm pretty sure this is worst case) I could process between 2 and 3 images per second.

-- Mark Rogers More Solutions Ltd :: 0845 45 89 555

MJ Ray

28 Jul 28 Jul

7:27 a.m.

(Ted Harding) Ted.Harding@nessie.mcc.ac.uk

...

I have to agree with this too! The whole point of MIME and attachments is to provide a mechanism for transmitting information of kinds which plain-text mail is not suitable for. People need to be able to send and receive such information.

In most cases, the 33% size increase, the possibilities for interception and mail munging make it a fairly poor mechanism for file transfer. A useful last resort, though.

I can see the point of MIME for one-off rich content, but that doesn't seem the most common legitimate use and the illegitimate uses dwarf legitimate ones. On balance, MIME seems to have turned out to be evil.

ObTopic: I'd be surprised if ocr'ing spam images worked well and what would happen when it saw one of the worms with an image that tries to exploit an error in a common graphics library?

Hope that helps,

-- MJ Ray - personal email, see http://mjr.towers.org.uk/email.html Work: http://www.ttllp.co.uk/ irc.oftc.net/slef Jabber/SIP ask

Dave

9:40 a.m.

MJ Ray mjr@phonecoop.coop wrote:

...

In most cases, the 33% size increase, the possibilities for interception and mail munging make it a fairly poor mechanism for file transfer. A useful last resort, though.

I can see the point of MIME for one-off rich content, but that doesn't seem the most common legitimate use and the illegitimate uses dwarf legitimate ones. On balance, MIME seems to have turned out to be evil.

Doesn't non-mime mean just us-ascii, ie just Roman alphabet, no accents? Wouldn't that be a bit limiting for non-English speaking email users? Dave Cooper

Mark Rogers

9:59 a.m.

Dave wrote:

...

Doesn't non-mime mean just us-ascii, ie just Roman alphabet, no accents? Wouldn't that be a bit limiting for non-English speaking email users?

It would also mean no rick text (fonts, colours, etc). That wouldn't bother me and I'm pretty sure MJR would love it, but its unrealistic to think that if Mime hadn't been invented then something else wouldn't have. (And we could UUencode attachments long before Mime was everywhere; all Mime really added to the party was rich text and internationalisation.)

If somebody had invented a rich-text alternative to email, along the lines of which you could argue that WWW is a rich-text alternative to gopher, then we'd all be stuck with that instead. At least email as it stands can swing both ways.

The nostalgia part of my brain longs for the days of multi-part UU-encoded files downloaded over a 9600 modem (my 2400 modem couldn't cope so I upgraded it), but I'm not sure the Internet would be so useful today had it not moved on.

-- Mark Rogers More Solutions Ltd :: 0845 45 89 555

MJ Ray

2:57 p.m.

Dave wrote:

...

Doesn't non-mime mean just us-ascii, ie just Roman alphabet, no accents? Wouldn't that be a bit limiting for non-English speaking email users?

I'm pretty sure that 8-bit transfer predates MIME. MIME is the most common way of specifying it now, but it's only a small part of it.

Mark Rogers mark@quarella.co.uk wrote:

...

It would also mean no rick text (fonts, colours, etc). That wouldn't bother me and I'm pretty sure MJR would love it, but its unrealistic to

I'd love stopping the idiots who specify things like white background with no specified text colour, so it goes wrong unless everyone has a default text colour similar to theirs.

If they aren't competent to design rich emails, they should stick to plain text ones rather than causing everyone pain. What's that saying about DTP? Something like "the good thing about DTP is that every man and his dog can design publications now; the bad thing about DTP is that most publications look like dogs designed them."

However, I did note that one-off rich content is a worthwhile use of MIME. I'm surprised that two of the apparent rebuttals use it as an example - rich emails and the PDF quotes - when I already said it was fine (although most people prefer a text quote to a PDF attachment, in my experience). Let me be clear:

- One-off rich mails are a fine use of MIME. - Most MIME use is not one-off rich mail. - MIME has turned out to be mostly evil.

All I'm asking for is responsible use of it, then we can change the balance back. Stuff like mass-mailing Word documents, letterhead graphics and video clips must stop, or per-byte mailserver charges will become reality sooner than we'd like.

ObOrigTopic: it seems like it would be far cheaper to switch on detection of repeat attachments and reject them at SMTP time with an appropriate "no bulk attachments" message. I like the idea of ocr'ing one-off attachments, for accessibility and searching, but that may be best done in the mail client.

Best wishes,

-- MJ Ray - personal email, see http://mjr.towers.org.uk/email.html Work: http://www.ttllp.co.uk/ irc.oftc.net/slef Jabber/SIP ask

Mark Rogers

4:57 p.m.

MJ Ray wrote:

...

I'd love stopping the idiots who specify things like white background with no specified text colour, so it goes wrong unless everyone has a default text colour similar to theirs.

It also tends to go horribly horribly wrong when quoted.

...

If they aren't competent to design rich emails, they should stick to plain text ones rather than causing everyone pain.

I do use richtext a lot (when I know the recipient can cope with it) - stuff like bullet points can help to make an email clearer. Shame that most email clients make bullets hard work.

...

[...] (although most people prefer a text quote to a PDF attachment, in my experience).

IME, most people prefer sufficient content in the email text to avoid having to open the attachment for the important points, but want the content in a form they can easily print (which for most clients means an attachment - otherwise you are forced to print email details like From/Subject/etc).

Personally, I always send quotes as PDFs. It's pretty much what PDFs were designed for - I know how it'll look when they get it, its designed for printing, and (without some messing around) they can't edit it.

...

Stuff like mass-mailing Word documents, letterhead graphics and video clips must stop, or per-byte mailserver charges will become reality sooner than we'd like.

I'm not convinced that (on average) capacity isn't keeping pace with usage. The problem occurs for non-average (when you're stuck on dial-up, for example).

If mail clients just told the user what they were doing and helped them get it right that would go a long way. The most popular ones do not make the difference between mailing a 5k attachment and a 5M attachment apparant. When sending a photo, would it be that hard to offer a resize/convert to jpeg option?

The idea of forwarding "funny" images is not going to go away.

...

ObOrigTopic: it seems like it would be far cheaper to switch on detection of repeat attachments and reject them at SMTP time with an appropriate "no bulk attachments" message.

I'm not sure the attachments (at a binary level) are repeats. I haven't checked, but even if they are all the same then this would very quickly be changed (making a one or two pixel change on sending would be pretty simple and would change the characteristics of the attachment sufficiently). It would be necessary to allow my web designer to email me a few iterations of a logo, for example, without having it blocked.

OCR-ing take the step of making qualitative decisions, not just quantitative ones.

...

I like the idea of ocr'ing one-off attachments, for accessibility and searching, but that may be best done in the mail client.

I think office server is the place, not mail client (but also not ISP mail server). Ie the same place spamassasin sits. But if Thunderbird had an extension to do this I'd use it!

Of-course the email client has the code to render the email, so that may be the easiest place. On the other hand, the joy of FOSS is that you can take code from one place and put it somewhere else....

I did look for libraries for converting an email to an image, but without success (although I didn't try that hard).

-- Mark Rogers More Solutions Ltd :: 0845 45 89 555

Tim Green

5:04 p.m.

On 7/28/06, Mark Rogers mark@quarella.co.uk wrote:

...

When sending a photo, would it be that hard to offer a resize/convert to jpeg option?

My mobile phone gets this right! If I try to send a large image, it offers the choice of full size (2 megapixels) or a 640x480 copy of the image (leaving the orginal intact in the phone).

Regards, Tim.

Wayne Stallwood

6:13 p.m.

On Fri, 2006-07-28 at 17:04 +0100, Tim Green wrote:

...

On 7/28/06, Mark Rogers mark@quarella.co.uk wrote:

...
When sending a photo, would it be that hard to offer a resize/convert to jpeg option?

My mobile phone gets this right! If I try to send a large image, it offers the choice of full size (2 megapixels) or a 640x480 copy of the image (leaving the orginal intact in the phone).

Windows XP also does this if you right click an image and do "send to" "mail recipient" it will prompt you to make the image smaller.

Ted.Harding＠nessie.mcc.ac.uk

6:43 p.m.

On 28-Jul-06 Wayne Stallwood wrote:

...

On Fri, 2006-07-28 at 17:04 +0100, Tim Green wrote:

...
On 7/28/06, Mark Rogers mark@quarella.co.uk wrote:

...
When sending a photo, would it be that hard to offer a resize/convert to jpeg option?

My mobile phone gets this right! If I try to send a large image, it offers the choice of full size (2 megapixels) or a 640x480 copy of the image (leaving the orginal intact in the phone).

Windows XP also does this if you right click an image and do "send to" "mail recipient" it will prompt you to make the image smaller.

And you can even do this in Linux -- if you use one of the many image applications which have an option to re-size the image (and save it under a different name).

Mind you, I don't know of a Linux email application that offers you the choice, but you can always make a conscious decision to do it before sending the email so that the reduced image is ready and waiting.

I like using the 'display' program from ImageMagick for this (and for other things). You can choose the new pixel dimensions as you like (though of course it's ususally wise to preserve the original aspect ration, which you can determine from the "Image Info".

Best wishes, Ted.

-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@nessie.mcc.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 28-Jul-06 Time: 18:43:02 ------------------------------ XFMail ------------------------------

MJ Ray

9:42 p.m.

Ted Harding Ted.Harding@nessie.mcc.ac.uk wrote:

...

Mind you, I don't know of a Linux email application that offers you the choice, [...]

GNUMail labels the image with its size, so it's pretty obvious. Sadly, there's not yet a 'New mail with resized image attachment' Service available.

Hope that helps,

-- MJ Ray - see http://mjr.towers.org.uk/email.html North End, Lynn, Norfolk, England Work: http://www.ttllp.co.uk/ IRC/Jabber/SIP: on request

Wayne Stallwood

9:54 a.m.

Almost every business I deal with has a good case for using attachments. It doesn't matter if they aren't the most efficient or most secure way of transferring files, the functionality is there and if the person is sending an email anyway it is the most convenient way.

As an example, when we send out quotes we generally send them as an email with attached PDF. The email body contains a covering description and the PDF contains the quote itself.

To even consider a system whereby we have some hosted service and say generate a unique URL for each hosted quotation really adds nothing for us or our customers (well apart from being able to look at a log and see when they downloaded the quote, which is pretty useless information)

Other times we may be collaborating on a document with a client (perhaps it's a spec that requires amendments either end) The easiest way of doing this is to mall the document backward and forward with amendments attached. Yes we could have some fancy version control system but most of our clients would be put off by the complexity. So now by offering URL's we have to support upload as well...hmmm

Now in some cases I agree. If I want to share my holiday photos, mailing them to 20 friends is a silly way of doing it. In a case like this hosting them and sending everyone a link would make much more sense.

It's a case of right tool for the right job, If I want to collaborate on a document or send a small file within the context of an email then I want to use email to do it. If I want to share a number of files with a number of people then there are better ways.

Mark Rogers

11:26 a.m.

MJ Ray wrote:

...

ObTopic: I'd be surprised if ocr'ing spam images worked well and what would happen when it saw one of the worms with an image that tries to exploit an error in a common graphics library?

I'm not convinced it would be so bad (and, from looking at the spam I've had recently the image is a single image, not a composite of multiple images, although I appreciate that this would be the spammer's next step).

Assume the OCR makes mistakes reading the text and comes out with some (to a human reading it) garbage. So what? The next time it sees the same image it'll see the same garbage and if it learnt from the first one it'll recognise the second one. This is no different from the previous spammer trick of mixing numbers and letters and misspellings. I'm sure I'm not alone in having seen one or two image spams repeated ad-nausiem recently.

As to whether a worm can exploit an error in a graphics library - well what if it exploits an error in spamassassin, or exim/qmail, or something else?

There are also some non-spam benefits: I could search for text within images in archived emails. It wouldn't help me much but a lot of our clients are insurance brokers who seem to email lots of scanned documents around as .tiff or .jpeg (or more usually as a bitmap embedded in a .doc, but that's another issue entirely...)

-- Mark Rogers More Solutions Ltd :: 0845 45 89 555

6938

Age (days ago)

6943

Last active (days ago)

main@lists.alug.org.uk

23 comments

8 participants

tags (0)

participants (8)

Brett Parker
Dave
David Reynolds
Mark Rogers
MJ Ray
Ted.Harding＠nessie.mcc.ac.uk
Tim Green
Wayne Stallwood