At Sun, 07 Jun 2009 12:05:29 +0100 (BST), (Ted Harding) wrote:
On 07-Jun-09 10:36:02, mick wrote:
On Sun, 07 Jun 2009 08:41:36 +0100 (BST) (Ted Harding) Ted.Harding@manchester.ac.uk allegedly wrote:
Greetings! I'm looking for a way to find out what files for a particular wild-card form exist at a certain directory depth on a website.
For example:
www.some.web.page/*/*.png
for all PNG files 1 below the top level. The results to be listed (stored in a file) along the lines of what you would get from
ls */*.png
if you were at the top level on the server.
A browser won't do it (won't accept wild-cards). I've looked at wget, but this doesn't seem to have a simple listing option (except under ftp mode, which the remote site won't respond to).
Any suggestions?
Ted If I've understood you correctly then find should do it. Try find . -maxdepth 2 -name "*.png" -print (You can redirect output to a file of course) Mick
Thanks, Mick, but I can't execute 'find' on a remote website! This can only be accessed by http.
Specifically: Go to http://journal.sjdm.org/ and the click on, say, "1" in "Vol. 4 (2009): 1" You will see a list of articles, of whichthe first has URL: http://journal.sjdm.org/8816/jdm8816.pdf
Note the "/8816/" -- this is the number of the article. Now go to that directory: http://journal.sjdm.org/8816 and note the listing of files with extensions .html, .pdf, .tex, .gif.
Now do similarly with the second article, which has URL: http://journal.sjdm.org/81125/jdm81125.pdf and its directory is: http://journal.sjdm.org/81125 in which, as well as .html, .pdf, .tex and .gif files, there is a a file: fig1.R
It is similar throughout: the directories for the articles are
where * is a 4- or 5-digit number (of the article), and some of these directories have one or more ".R" files in them, others have none.
What I want is to list (with directory paths) all ".R" files on that web-page. In other words, in regexp language,
http://journal.sjdm.org/%5B0:9%5D+/*.R
However, it seems that the HTTP protocol won't accept "wild cards" (or so wget tells me), and the website won't accept FTP access (where I could use wild cards).
Er, yes. You can't do anything equivalent to a kind of 'httpsh', 'shell over http' unless someone actually implements an http-speaking server which accepts UNIX shell-like URIs and returns UNIX shell-like responses.
Please take a moment to consider the security implications of http allowing users to do things like directory listing, file renaming, indiscriminate PUTting, etc. ...
The closest you might get is to use wget in spider mode to download a complete copy of the web site in question, then execute the shell commands of interest against your local copy:
$ mkdir sjdm $ cd sjdm $ wget -r -k -np -nH http://journal.sjdm.org/
should do it. Please see your man wget for details.
Then you can use find in your new sjdm directory.