[ALUG] Globbed search for filenames in website

Richard Lewis richardlewis at fastmail.co.uk
Sun Jun 7 12:39:27 BST 2009


At Sun, 07 Jun 2009 12:05:29 +0100 (BST),
(Ted Harding) wrote:
> 
> On 07-Jun-09 10:36:02, mick wrote:
> > On Sun, 07 Jun 2009 08:41:36 +0100 (BST)
> > (Ted Harding) <Ted.Harding at manchester.ac.uk> allegedly wrote:
> > 
> >> Greetings!
> >> I'm looking for a way to find out what files for a particular
> >> wild-card form exist at a certain directory depth on a website.
> >> 
> >> For example:
> >> 
> >>   www.some.web.page/*/*.png
> >> 
> >> for all PNG files 1 below the top level. The results to be listed
> >> (stored in a file) along the lines of what you would get from
> >> 
> >>   ls */*.png
> >> 
> >>if you were at the top level on the server.
> >> 
> >> A browser won't do it (won't accept wild-cards). I've looked at wget,
> >> but this doesn't seem to have a simple listing option (except under
> >> ftp mode, which the remote site won't respond to).
> >> 
> >> Any suggestions?
> > 
> > Ted
> > If I've understood you correctly then find should do it. Try
> > find . -maxdepth 2 -name "*.png" -print
> > (You can redirect output to a file of course)
> > Mick
> 
> Thanks, Mick, but I can't execute 'find' on a remote website!
> This can only be accessed by http.
> 
> Specifically: Go to
>   http://journal.sjdm.org/
> and the click on, say, "1" in "Vol. 4 (2009):  1"
> You will see a list of articles, of whichthe first has URL:
>   http://journal.sjdm.org/8816/jdm8816.pdf
> 
> Note the "/8816/" -- this is the number of the article. Now
> go to that directory:
>   http://journal.sjdm.org/8816
> and note the listing of files with extensions .html, .pdf, .tex, .gif.
> 
> Now do similarly with the second article, which has URL:
>   http://journal.sjdm.org/81125/jdm81125.pdf
> and its directory is:
>   http://journal.sjdm.org/81125
> in which, as well as .html, .pdf, .tex and .gif files, there is
> a a file: fig1.R
> 
> It is similar throughout: the directories for the articles are
> 
>   http://journal.sjdm.org/*
> 
> where * is a 4- or 5-digit  number (of the article), and some of
> these directories have one or more ".R" files in them, others
> have none.
> 
> What I want is to list (with directory paths) all ".R" files on
> that web-page. In other words, in regexp language,
> 
>   http://journal.sjdm.org/[0:9]+/*.R
> 
> However, it seems that the HTTP protocol won't accept "wild cards"
> (or so wget tells me), and the website won't accept FTP access
> (where I could use wild cards).
> 
Er, yes. You can't do anything equivalent to a kind of 'httpsh',
'shell over http' unless someone actually implements an http-speaking
server which accepts UNIX shell-like URIs and returns UNIX shell-like
responses.

Please take a moment to consider the security implications of http
allowing users to do things like directory listing, file renaming,
indiscriminate PUTting, etc. ...

The closest you might get is to use wget in spider mode to download a
complete copy of the web site in question, then execute the shell
commands of interest against your local copy:

$ mkdir sjdm
$ cd sjdm
$ wget -r -k -np -nH http://journal.sjdm.org/

should do it. Please see your man wget for details.

Then you can use find in your new sjdm directory.
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Richard Lewis
ISMS, Computing
Goldsmiths, University of London
Tel: +44 (0)20 7078 5134
Skype: richardjlewis
JID: ironchicken at jabber.earth.li
http://www.richard-lewis.me.uk/
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
+-------------------------------------------------------+
|Please avoid sending me Word or PowerPoint attachments.|
|http://www.gnu.org/philosophy/no-word-attachments.html |
+-------------------------------------------------------+



More information about the main mailing list