Occasionally you need to mirror a website (or a directory inside one). If you've only got HTTP access, there are tools like httrack which are pretty good (albeit pretty ugly) at doing this. However, as far as I can tell, you can't use httrack on a password-protected website.
curl can probably do this too, and supports authentication, but it wasn't obvious.
So I ended up using wget, as it supports mirroring and credentials. But the issue here is that wget plays nice and respects robots.txt; which can actually prevent you mirroring a site you own. And nothing in the man page explains how to ignore robots.txt.
Eventually, I came up with this incantation, which works for me (access to password-protected site, full mirror, ignoring robots.txt):
wget -e robots=off --wait 1 -x --user=xxx --password=xxx -m -k http://web_site_name_to_mirror/
where:
- -e robots=off obviously disables robots
- --wait 1 forces a pause between gets (so the site doesn't get hammered)
- --user and --password: self-evident
- -x creates a local directory structure which "mirrors" (see what I did there?) the directory structure on the site you're mirroring
- -m turns on mirror mode: "turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings" (from the man page)
- -k converts links after download so that URLs in the mirrored files reference local files
Don't use it carelessly on someone else's website, as they might get angry...