Re: [ALUG] Globbed search for filenames in website

7 Jun 2009


      At Sun, 07 Jun 2009 12:05:29 +0100 (BST),
(Ted Harding) wrote:
...
On 07-Jun-09 10:36:02, mick wrote:
...
On Sun, 07 Jun 2009 08:41:36 +0100 (BST)
(Ted Harding) Ted.Harding@manchester.ac.uk allegedly wrote:
...
Greetings!
I'm looking for a way to find out what files for a particular
wild-card form exist at a certain directory depth on a website.
For example:
www.some.web.page/*/*.png
for all PNG files 1 below the top level. The results to be listed
(stored in a file) along the lines of what you would get from
ls */*.png
if you were at the top level on the server.
A browser won't do it (won't accept wild-cards). I've looked at wget,
but this doesn't seem to have a simple listing option (except under
ftp mode, which the remote site won't respond to).
Any suggestions?
Ted
If I've understood you correctly then find should do it. Try
find . -maxdepth 2 -name "*.png" -print
(You can redirect output to a file of course)
Mick
Thanks, Mick, but I can't execute 'find' on a remote website!
This can only be accessed by http.
Specifically: Go to
  http://journal.sjdm.org/
and the click on, say, "1" in "Vol. 4 (2009):  1"
You will see a list of articles, of whichthe first has URL:
  http://journal.sjdm.org/8816/jdm8816.pdf
Note the "/8816/" -- this is the number of the article. Now
go to that directory:
  http://journal.sjdm.org/8816
and note the listing of files with extensions .html, .pdf, .tex, .gif.
Now do similarly with the second article, which has URL:
  http://journal.sjdm.org/81125/jdm81125.pdf
and its directory is:
  http://journal.sjdm.org/81125
in which, as well as .html, .pdf, .tex and .gif files, there is
a a file: fig1.R
It is similar throughout: the directories for the articles are
http://journal.sjdm.org/*
where * is a 4- or 5-digit  number (of the article), and some of
these directories have one or more ".R" files in them, others
have none.
What I want is to list (with directory paths) all ".R" files on
that web-page. In other words, in regexp language,
http://journal.sjdm.org/%5B0:9%5D+/*.R
However, it seems that the HTTP protocol won't accept "wild cards"
(or so wget tells me), and the website won't accept FTP access
(where I could use wild cards).
Er, yes. You can't do anything equivalent to a kind of 'httpsh',
'shell over http' unless someone actually implements an http-speaking
server which accepts UNIX shell-like URIs and returns UNIX shell-like
responses.
Please take a moment to consider the security implications of http
allowing users to do things like directory listing, file renaming,
indiscriminate PUTting, etc. ...
The closest you might get is to use wget in spider mode to download a
complete copy of the web site in question, then execute the shell
commands of interest against your local copy:
$ mkdir sjdm
$ cd sjdm
$ wget -r -k -np -nH http://journal.sjdm.org/
should do it. Please see your man wget for details.
Then you can use find in your new sjdm directory.
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Richard Lewis
ISMS, Computing
Goldsmiths, University of London
Tel: +44 (0)20 7078 5134
Skype: richardjlewis
JID: ironchicken@jabber.earth.li
http://www.richard-lewis.me.uk/
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
+-------------------------------------------------------+
|Please avoid sending me Word or PowerPoint attachments.|
|http://www.gnu.org/philosophy/no-word-attachments.html |
+-------------------------------------------------------+

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: [ALUG] Globbed search for filenames in website