Pages

Wednesday, February 6, 2013

Using wget to mirror a website



hackedOccasionally you need to mirror a website (or a directory inside one). If you've only got HTTP access, there are tools like httrack which are pretty good (albeit pretty ugly) at doing this. However, as far as I can tell, you can't use httrack on a password-protected website.

curl can probably do this too, and supports authentication, but it wasn't obvious.

So I ended up using wget, as it supports mirroring and credentials. But the issue here is that wget plays nice and respects robots.txt; which can actually prevent you mirroring a site you own. And nothing in the man page explains how to ignore robots.txt.

Eventually, I came up with this incantation, which works for me (access to password-protected site, full mirror, ignoring robots.txt):
wget -e robots=off --wait 1 -x --user=xxx --password=xxx -m -k http://web_site_name_to_mirror/

where:

  • -e robots=off obviously disables robots

  • --wait 1 forces a pause between gets (so the site doesn't get hammered)

  • --user and --password: self-evident

  • -x creates a local directory structure which "mirrors" (see what I did there?) the directory structure on the site you're mirroring

  • -m turns on mirror mode: "turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings" (from the man page)

  • -k converts links after download so that URLs in the mirrored files reference local files


Don't use it carelessly on someone else's website, as they might get angry...

About Me

My photo

Android, Google, Hardware and Networking, Technologies, Linux GEEK, Open Source follower, and a NOSKian.
 

Search

Translate