I've been in contact with a website which hosts a large range of images (www.bioimages.org.uk). The images are "a large selection of pictures of Natural History objects, mostly British in origin", which currently amount to around 400M of HTML (autogenerated from a database) on IIS, and 3-4GB of photographs hosted separately.
The site is currently offline due to attempted site rips causing load and bandwidth problems. I've offered to help look at the problem, but due to the capacity required I don't think I'm in a position to help with the hosting, although I may be able to help technically.
Several issues comes to mind which I'd like advice on. I think there are Apache modules which can help to control the abuse aspect, but I'm not sure where to start - any suggestions? There are options like scripting the site (eg using PHP) and limiting the number of downloads per session. I could probably come up with a solution but its not something I have experience in managing.
If anyone has suggestions for a generous host that would be useful too. What he's currently looking for is hosting for the HTML but not the image library, but cites bandwidth as the biggest problem. I need to work out why that is - I'd be surprised if the bandwidth fotr the HTML itself is a huge problem (although I think the way the pages are generated means that they are static pages with new timestamps after each generation, so search engine traffic alone could be massive).
I have no direct connection with the site; its a friend who manages a site which makes reference to the images that brought the problem to my attention. I am inclined to help because I think this sort of thing is what the web should be about (ie decent libraries of decent information).
In my opinion the only way you can prevent site rips causing bandwidth problems for this sort of content is to have either thumbnail or very highly compressed images, perhaps watermarked as well and then hack up some PHP to do a download cart style system with a login and quotas.
If the browser is going to display the image then by definition there will be a way to rip it, so the best you can do is make the image viewed in the browser either so small it doesn't matter or undesirable due to aggressive compression and/or watermarking.
It might be worth looking at one of the CMS systems (Joomla/Mambo or others) with one of the photo library plugins or one of the dedicated PHP photo libraries (which may be a better choice if you never need the other CMS functions)
As for hosting I am a bit confused as to how he has sufficient space and bandwidth to put the images somewhere and yet not have capacity for the pages ?
Wayne Stallwood wrote:
It might be worth looking at one of the CMS systems (Joomla/Mambo or others) with one of the photo library plugins or one of the dedicated PHP photo libraries (which may be a better choice if you never need the other CMS functions)
We do have some experience with "gallery" (http://gallery.menalto.com) and using that had crossed my mind. Limiting the site rip simply by making it too cumbersome is always an option :-)
As for hosting I am a bit confused as to how he has sufficient space and bandwidth to put the images somewhere and yet not have capacity for the pages ?
I'm trying to get to the bottom of this too. It looks like he's found someone to host all the images (not worked out where yet; since the site is currently down I'm waiting on answers instead of working it out myself), and it is just the fact that the HTML is auto-generated from a database (and thus static content always looks "new") that is causing a massive bandwidth headache. If so, then this can almost certainly be massively simplified very easily. It would also mean that the 400MB+ of static HTML could become a few hundred kB of templates populated on the fly from the database, with suitable headers generated so that old content isn't seen as new. If this is all that's required I can do this myself (including hosting it).
However, if my understanding of the setup is correct, then the image hosting will be a problem because at some level all the main site is doing is giving out URLs to the images on the image host, which means he will lose control of those images completely. Unless he's pulling them through code on the local site (unlikely as I don't think there's any script code there), in which case his bandwidth requirements would be huge! So medium to long term I think he does need to get everything onto one host, which is where I will come unstuck as I can't realistically allocate 4GB of disk space, regardless of bandwidth requirements.
Incidentally, I've been directed towards http://www.coralcdn.org/ as an option for assisting on the bandwidth front, which is something I hadn't heard of until now. from the site: "Coral is peer-to-peer content distribution network, comprised of a world-wide network of web proxies and nameservers. It allows a user to run a web site that offers high performance and meets huge demand, all for the price of a $50/month cable modem."
If you contact me off list I'll see what I can to to help with the hosting of the HTML and *possibly the images.
I'd second the "use thumbnails" suggestion along with some a few other solutions: -
1) make use of the HTTP_REFERER variable (not perfect but useful) 2) store the images out of a public folder then hardlink them when the correct index page is hit and use a crobjob the clean the hardlinks if the images hasnt been accessed in, say, 10 mins using the +atime bit with find. I've used this method quite successfully many times. 3) possibly use mod_bandwitch with apache (not tried it yet) 4) possibly use mod_security with apache (again not tried it yet)
Cheers
Stuart
Stuart Fox wrote:
If you contact me off list I'll see what I can to to help with the hosting of the HTML and *possibly the images.
Email sent off-list - thanks for the offer of possible help.
I'd second the "use thumbnails" suggestion along with some a few other solutions: -
- make use of the HTTP_REFERER variable (not perfect but useful)
I am told he has tried this, but as I haven't seen the code I'm not sure why this wasn't helpful. He also used Javascript hacks but I can't see how they would be helpful. At the end of the day its not easy to protect images on a third party site (but that shouldn't affect his bandwidth on the main site), but if the problem is HTML that's not being cached then no amount of Javascript is going to help.
- store the images out of a public folder then hardlink them when the
correct index page is hit and use a crobjob the clean the hardlinks if the images hasnt been accessed in, say, 10 mins using the +atime bit with find. I've used this method quite successfully many times.
Interesting hack; yes I can see that working quite well.
I need to find out (and have asked but not had reply yet) what he actually considers as "abuse". The reason I know about this site is through another site which links (with permission) to his images.
- possibly use mod_bandwitch with apache (not tried it yet)
- possibly use mod_security with apache (again not tried it yet)
I'll look at mod_bandwidth and mod_security but I think that while they probably go some way to solving the problem I started the thread with, they may not be so relevant the more I come to understand about what the problem really is.
Stuart if you see this can you contact me ASAP please. If anyone is in contact with Stuart, can you pass a message on please as his email address is not working.
Thanks Tony