Globbed search for filenames in website

List overview All Threads
Download

newer

older

Eee and root

fcron or anacron

Ted.Harding＠manchester.ac.uk

7 Jun 2009 7 Jun '09

8:41 a.m.

Greetings! I'm looking for a way to find out what files for a particular wild-card form exist at a certain directory depth on a website.

For example:

www.some.web.page/*/*.png

for all PNG files 1 below the top level. The results to be listed (stored in a file) along the lines of what you would get from

ls */*.png

if you were at the top level on the server.

A browser won't do it (won't accept wild-cards). I've looked at wget, but this doesn't seem to have a simple listing option (except under ftp mode, which the remote site won't respond to).

Any suggestions? With thanks, Ted.

-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@manchester.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 07-Jun-09 Time: 08:41:32 ------------------------------ XFMail ------------------------------

Show replies by date

mick

7 Jun 7 Jun

11:36 a.m.

On Sun, 07 Jun 2009 08:41:36 +0100 (BST) (Ted Harding) Ted.Harding@manchester.ac.uk allegedly wrote:

...

Greetings! I'm looking for a way to find out what files for a particular wild-card form exist at a certain directory depth on a website.

For example:

www.some.web.page/*/*.png

for all PNG files 1 below the top level. The results to be listed (stored in a file) along the lines of what you would get from

ls */*.png

if you were at the top level on the server.

A browser won't do it (won't accept wild-cards). I've looked at wget, but this doesn't seem to have a simple listing option (except under ftp mode, which the remote site won't respond to).

Any suggestions?

Ted

If I've understood you correctly then find should do it. Try

find . -maxdepth 2 -name "*.png" -print

(You can redirect output to a file of course)

Mick

---------------------------------------------------------------------

The text file for RFC 854 contains exactly 854 lines. Do you think there is any cosmic significance in this?

Douglas E Comer - Internetworking with TCP/IP Volume 1

http://www.ietf.org/rfc/rfc854.txt ---------------------------------------------------------------------

Ted.Harding＠manchester.ac.uk

12:05 p.m.

On 07-Jun-09 10:36:02, mick wrote:

...

On Sun, 07 Jun 2009 08:41:36 +0100 (BST) (Ted Harding) Ted.Harding@manchester.ac.uk allegedly wrote:

...
Greetings! I'm looking for a way to find out what files for a particular wild-card form exist at a certain directory depth on a website.

For example:

www.some.web.page/*/*.png

for all PNG files 1 below the top level. The results to be listed (stored in a file) along the lines of what you would get from

ls */*.png

if you were at the top level on the server.

A browser won't do it (won't accept wild-cards). I've looked at wget, but this doesn't seem to have a simple listing option (except under ftp mode, which the remote site won't respond to).

Any suggestions?

Ted If I've understood you correctly then find should do it. Try find . -maxdepth 2 -name "*.png" -print (You can redirect output to a file of course) Mick

Thanks, Mick, but I can't execute 'find' on a remote website! This can only be accessed by http.

Specifically: Go to http://journal.sjdm.org/ and the click on, say, "1" in "Vol. 4 (2009): 1" You will see a list of articles, of whichthe first has URL: http://journal.sjdm.org/8816/jdm8816.pdf

Note the "/8816/" -- this is the number of the article. Now go to that directory: http://journal.sjdm.org/8816 and note the listing of files with extensions .html, .pdf, .tex, .gif.

Now do similarly with the second article, which has URL: http://journal.sjdm.org/81125/jdm81125.pdf and its directory is: http://journal.sjdm.org/81125 in which, as well as .html, .pdf, .tex and .gif files, there is a a file: fig1.R

It is similar throughout: the directories for the articles are

http://journal.sjdm.org/*

where * is a 4- or 5-digit number (of the article), and some of these directories have one or more ".R" files in them, others have none.

What I want is to list (with directory paths) all ".R" files on that web-page. In other words, in regexp language,

http://journal.sjdm.org/%5B0:9%5D+/*.R

However, it seems that the HTTP protocol won't accept "wild cards" (or so wget tells me), and the website won't accept FTP access (where I could use wild cards).

Ted.

-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@manchester.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 07-Jun-09 Time: 12:05:24 ------------------------------ XFMail ------------------------------

Richard Lewis

12:39 p.m.

At Sun, 07 Jun 2009 12:05:29 +0100 (BST), (Ted Harding) wrote:

...

On 07-Jun-09 10:36:02, mick wrote:

...
On Sun, 07 Jun 2009 08:41:36 +0100 (BST) (Ted Harding) Ted.Harding@manchester.ac.uk allegedly wrote:

...
Greetings! I'm looking for a way to find out what files for a particular wild-card form exist at a certain directory depth on a website.

For example:

www.some.web.page/*/*.png

for all PNG files 1 below the top level. The results to be listed (stored in a file) along the lines of what you would get from

ls */*.png

if you were at the top level on the server.

A browser won't do it (won't accept wild-cards). I've looked at wget, but this doesn't seem to have a simple listing option (except under ftp mode, which the remote site won't respond to).

Any suggestions?

Ted If I've understood you correctly then find should do it. Try find . -maxdepth 2 -name "*.png" -print (You can redirect output to a file of course) Mick

Thanks, Mick, but I can't execute 'find' on a remote website! This can only be accessed by http.

Specifically: Go to http://journal.sjdm.org/ and the click on, say, "1" in "Vol. 4 (2009): 1" You will see a list of articles, of whichthe first has URL: http://journal.sjdm.org/8816/jdm8816.pdf

Note the "/8816/" -- this is the number of the article. Now go to that directory: http://journal.sjdm.org/8816 and note the listing of files with extensions .html, .pdf, .tex, .gif.

Now do similarly with the second article, which has URL: http://journal.sjdm.org/81125/jdm81125.pdf and its directory is: http://journal.sjdm.org/81125 in which, as well as .html, .pdf, .tex and .gif files, there is a a file: fig1.R

It is similar throughout: the directories for the articles are

http://journal.sjdm.org/*

where * is a 4- or 5-digit number (of the article), and some of these directories have one or more ".R" files in them, others have none.

What I want is to list (with directory paths) all ".R" files on that web-page. In other words, in regexp language,

http://journal.sjdm.org/%5B0:9%5D+/*.R

However, it seems that the HTTP protocol won't accept "wild cards" (or so wget tells me), and the website won't accept FTP access (where I could use wild cards).

Er, yes. You can't do anything equivalent to a kind of 'httpsh', 'shell over http' unless someone actually implements an http-speaking server which accepts UNIX shell-like URIs and returns UNIX shell-like responses.

Please take a moment to consider the security implications of http allowing users to do things like directory listing, file renaming, indiscriminate PUTting, etc. ...

The closest you might get is to use wget in spider mode to download a complete copy of the web site in question, then execute the shell commands of interest against your local copy:

$ mkdir sjdm $ cd sjdm $ wget -r -k -np -nH http://journal.sjdm.org/

should do it. Please see your man wget for details.

Then you can use find in your new sjdm directory.

-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Richard Lewis ISMS, Computing Goldsmiths, University of London Tel: +44 (0)20 7078 5134 Skype: richardjlewis JID: ironchicken@jabber.earth.li http://www.richard-lewis.me.uk/ -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- +-------------------------------------------------------+ |Please avoid sending me Word or PowerPoint attachments.| |http://www.gnu.org/philosophy/no-word-attachments.html | +-------------------------------------------------------+

Wayne Stallwood

3:51 p.m.

On Sun, 2009-06-07 at 12:39 +0100, Richard Lewis wrote:

...

Please take a moment to consider the security implications of http allowing users to do things like directory listing, file renaming, indiscriminate PUTting, etc. ...

There is a tool in BackTrack that does a sort of brute force search for files over HTTP, I forget it's name but it found a scary amount of "hidden" stuff I had stuck on my hosting at various times for specific people to download.

Although the webmaster of the site in question won't thank you for running it.

Needless to say I am more careful about what I put up now :)

mick

1:03 p.m.

On Sun, 07 Jun 2009 12:05:29 +0100 (BST) (Ted Harding) Ted.Harding@manchester.ac.uk allegedly wrote: .

...

What I want is to list (with directory paths) all ".R" files on that web-page. In other words, in regexp language,

http://journal.sjdm.org/%5B0:9%5D+/*.R

However, it seems that the HTTP protocol won't accept "wild cards" (or so wget tells me), and the website won't accept FTP access (where I could use wild cards).

Ted

Sorry - I hadn't spotted that you had no shell access. But maybe curl can help. You might be able to grab all the files from the directories you want and save them locally so that you can search them later.

Mick

---------------------------------------------------------------------

The text file for RFC 854 contains exactly 854 lines. Do you think there is any cosmic significance in this?

Douglas E Comer - Internetworking with TCP/IP Volume 1

http://www.ietf.org/rfc/rfc854.txt ---------------------------------------------------------------------

5896

Age (days ago)

5896

Last active (days ago)

main@lists.alug.org.uk

5 comments

4 participants

tags (0)

participants (4)

mick
Richard Lewis
Ted.Harding＠manchester.ac.uk
Wayne Stallwood