http download automation script?

Johannes Franken jfranken at jfranken.de
Sat Dec 10 19:04:59 UTC 2005


* Magnus Andersen <mag.andersen at gmail.com> [2005-12-06 20:37 +0100]:
> I need to download files from a website and I'd like to automate it.  I have
> to login to the website, navigate to the download section and download the
> files.  They do not have an ftp site and I have to do this over http.  The
> system I'll be doing this from is a RHEL 3 As system.

Definitely try "wget".

wget copies files over http, https or ftp from the corresponding
servers.

Additionally it can
    * (for http(s):) follow the links contained in HTML-files and
    * (for ftp:) grab any subdirectories.

If the server conveys filedates, wget will adopt them for the files it
receives. This way it can avoid retransmission of files already
available.

The copy differs from the original at the following details:
    * Files that have been deleted from the server, stay alive in the
      copy.
    * Files not pointed at by a link are missing (for http).
    * No permissions (owner, group, authorization) are transferred.


Usage:
    wget options URL

Options of interest are:

-N 
Do not download files that are already available locally and
match the server's filedate.

-nH --cut-dirs=2 
In recursive mode, wget normally creates a subdirectory for the hostname
and any directories mentioned in the URL.  The option -nH suppresses the
creation of hostdirs, and --cut-dirs=2 the creation of the first two
directories. For example: 
wget -r -nH --cut-dirs=2 http://www.jfranken.de/homepages/johannes/vortraege 
will create the directory vortraege.

-k 
turns absolute URLs to relative ones in HTML-files. Caution, this does
not work in any situations.

-r -np 
(recursive, no-parent): If the given URL provides a html-file, wget will
also fetch any elements referenced (in particular links and graphics)
and repeat this procedure for them. The option -np avoids ascending to
the parent directory. wget ignores references to other hosts, except if
you set the parameter -H.

-p -l 10 
The parameter -l 10 limits the recursion depth for -r to 10 levels. The
default depth is 5. If you set -l 0, it downloads at infinitive depth,
which can cause filesystem problems on cyclic links.

-H -Djfranken.de,our-isp.org 
Also follow links to different servers, if they belong to the domain
jfranken.de or our-isp.org.

-nv 
Avoids output of debugging messages.


wget wil direct its ftp- or http-requests automatically to your
proxyserver, if the environment varibale http_proxy oder ftp_proxy are
set, e.g. by

    $ export http_proxy=http://jfranken:secret@proxy.jfranken.de:3128/
    $ export ftp_proxy=$http_proxy


Links:

    * wget project page: http://www.gnu.org/software/wget/wget.html
    * wget(1) manpage


Good luck!

-- 
Johannes Franken
 
Professional unix/network development
mailto:jfranken at jfranken.de
http://www.jfranken.de/




More information about the redhat-list mailing list