Pages

Wednesday, February 6, 2013

Using wget to mirror a website



hackedOccasionally you need to mirror a website (or a directory inside one). If you've only got HTTP access, there are tools like httrack which are pretty good (albeit pretty ugly) at doing this. However, as far as I can tell, you can't use httrack on a password-protected website.

curl can probably do this too, and supports authentication, but it wasn't obvious.

So I ended up using wget, as it supports mirroring and credentials. But the issue here is that wget plays nice and respects robots.txt; which can actually prevent you mirroring a site you own. And nothing in the man page explains how to ignore robots.txt.

Eventually, I came up with this incantation, which works for me (access to password-protected site, full mirror, ignoring robots.txt):
wget -e robots=off --wait 1 -x --user=xxx --password=xxx -m -k http://web_site_name_to_mirror/

where:

  • -e robots=off obviously disables robots

  • --wait 1 forces a pause between gets (so the site doesn't get hammered)

  • --user and --password: self-evident

  • -x creates a local directory structure which "mirrors" (see what I did there?) the directory structure on the site you're mirroring

  • -m turns on mirror mode: "turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings" (from the man page)

  • -k converts links after download so that URLs in the mirrored files reference local files


Don't use it carelessly on someone else's website, as they might get angry...

 

Search

Translate