Crawl Web pages and pictures using the Wget tool

Last Update:2014-12-17 Source: Internet

Author: User

Tags ftp file gz file http authentication

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Strange needs.

The company needs to cache the server's Web page to the router, and the user accesses the page directly to the cache on the router. Although I do not know the meaning of this demand, but still try to achieve it.

wget overview

wget is a web crawler under Unix and UNIX, and after I get acquainted with it, I find that it's much more than that. But this blog post only says How to crawl a specified URL and the related content (including html,js,css, image) below it and change the absolute path in the content to a relative path . Online Search a bunch of articles about wget, about how it crawled Web pages and related picture resources, anyway, I did not find a practical, all ended in failure.

This is the file content of Wget-h >./help_wget.txt

GNU Wget 1.16, a non-interactive network retriever. Usage:wget [OPTION] ... [URL] ... Mandatory arguments to long options is Mandatory for short options too. Startup:-V,--version display the version of Wget and exit. -H,--help print this help. -B,--background go to background after startup. -E,--execute=command execute a '. Wgetrc '-style COMMAND. Logging and input file:-O,--output-file=file log messages to file. -A,--append-output=file append messages to FILE. -Q,--quiet quiet (no output). -V,--verbose be verbose (this is the default). -NV,--no-verbose turn off verboseness, without being quiet. --report-speed=type Output Bandwidth as TYPE. TYPE can be bits. -I,--input-file=file download URLs found in local or external file. -F,--force-html treat input file as HTML. -B,--base=url Resolves HTML input-file links (-i-f) relative to URL. --config=file Specify config FILE to use. --no-config does not read any config file. Download:-T,--tries=number set number of retries to number (0 unlimits). --retry-connrefused Retry Even if connection is refused. -O,--output-document=file write documents to FILE. -NC,--no-clobber skip downloads that would download to existing files (o Verwriting them). -C,--continue resume getting a partially-downloaded file. --start-pos=offset start downloading from zero-based position OFFSET. --progress=type Select Progress Gauge TYPE. --show-progress Display the progress bar in any verbosity mode. -N,--timestamping don ' t re-retrieve files unless newer than Local. --no-use-server-timestamps don ' t set the local file ' s timestamp by the Server. -S,--server-response print server response. --spider don ' t download anything. -T,--timeout=seconds set all timeout values to SECONDS. --dns-timeout=secs Set the DNS lookup timeout to SECS. --connect-timeout=secs Set the Connect timeout to SECS. --read-timeout=secs set the Read timeout to SECS. -W,--wait=seconds wait SECONDS between retrievals. --waitretry=seconds wait 1..SECONDS between retries of a retrieval. --random-wait wait from 0.5*wait ... 1.5*wait secs between retrievals. --no-proxy explicitly turn off proxy. -Q,--quota=number set retrieval quota to number. --bind-address=address bind to address (hostname or IP) on local host. --limit-rate=rate Limit download rate. --no-dns-cache Disable caching DNS lookups. --restrict-file-names=os restrict chars in file names to ones OS allows. --ignore-case Ignore case when matching files/directories. -4,--inet4-only connect only to IPV4 addresses. -6,--inet6-only connect only to IPV6 addresses. --prefer-family=family Connect first to addresses of specified family, one of IPv6, IPv4, or none. --user=user set both FTP and HTTP user to user. --password=pass set both FTP and HTTP password to PASS. --ask-password Prompt for passwords. --no-iri turn off IRI support. --local-encoding=enc use ENC as the local encoding for IRIs. --remote-encoding=enc use ENC as the default remote encoding. --unlink remove file before ClobbeR.directories:-nd,--no-directories don ' t create directories. -X,--force-directories force creation of directories. -NH,--no-host-directories don ' t create host directories. --protocol-directories use protocol name in directories. -p,--directory-prefix=prefix save files to prefix/...--cut-dirs=number ignore number remote directory Components. HTTP options:--http-user=user set HTTP user to user. --http-password=pass set HTTP password to PASS. --no-cache Disallow server-cached data. --default-page=name Change the default page NAME (normally this is ' index.html '). ). -E,--adjust-extension save HTML/CSS documents with proper extensions. --ignore-length Ignore ' content-length ' header field. --header=string Insert STRING among the headers. --max-redirect Maximum RedIrections allowed per page. --proxy-user=user set user as proxy username. --proxy-password=pass set PASS as proxy password. --referer=url include ' Referer:url ' header in HTTP request. --save-headers Save the HTTP headers to file. -U,--user-agent=agent identify as agent instead of wget/version. --no-http-keep-alive Disable HTTP keep-alive (persistent connections). --no-cookies don ' t use cookies. --load-cookies=file load cookie from FILE before session. --save-cookies=file Save Cookie to FILE after session. --keep-session-cookies Load and save session (Non-permanent) cookies. --post-data=string use the Post method; Send STRING as the data. --post-file=file use the Post method; Send contents of FILE. --method=httpmethod use method ' HttpMethod ' in the request. --body-data=string Send STRING as data. --method must be set. --body-file=file Send contents of file. --method must be set. --content-disposition Honor the Content-disposition header when choosing local fil e names (experimental). --content-on-error output the received content on server errors. --auth-no-challenge Send Basic HTTP authentication information without first Wai Ting for the server ' s challenge. HTTPS (SSL/TLS) options:--SECURE-PROTOCOL=PR Choose Secure Protocol, one of auto, SSLv2, SSLv3, TLSv1 and PFS. --https-only only follow secure HTTPS links--no-check-certificate don ' t validate the server ' s C Ertificate. --certificate=file client certificate FILE. --certificate-type=type client certificate type, PEM or DER. --private-key=fILE private key file. --private-key-type=type private key type, PEM or DER. --ca-certificate=file FILE with the bundle of Ca ' s. --ca-directory=dir directory where hash list of Ca ' s is stored. --random-file=file file with random data for seeding the SSL PRNG. --egd-file=file file naming the EGD socket with random data. FTP options:--ftp-user=user set FTP user to user. --ftp-password=pass set FTP password to PASS. --no-remove-listing don ' t remove '. Listing ' files. --no-glob Turn off FTP file name globbing. --NO-PASSIVE-FTP Disable the "passive" transfer mode. --preserve-permissions Preserve remote file permissions. --retr-symlinks when recursing, get linked-to files (not dir). WARC options:--warc-file=filename save Request/response data to a. warc.gz file. --warc-header=string Insert STRING into the Warcinfo record. --warc-max-size=number set maximum size of Warc files to number. --WARC-CDX Write CDX index files. --warc-dedup=filename don't store records listed in this CDX file. --no-warc-compression don't compress warc files with GZIP. --no-warc-digests does not calculate SHA1 digests. --no-warc-keep-log does not store the log file in a Warc record. --warc-tempdir=directory location for temporary files created by the Warc writer. Recursive Download:-R,--recursive specify Recursive download. -L,--level=number maximum recursion depth (INF or 0 for infinite). --delete-after Delete files locally after downloading them. -K,--convert-links make links in downloaded HTML or CSS point to local fil Es. --backups=n Before writing file X, rotate up to N backup files. -K,--backup-converted before converting file X, back up as X.orig. -M,--mirror shortcut for-n-r-l inf--no-remove-listing. -P,--page-requisites get all images, etc. needed to display HTML page. --strict-comments turn on Strict (SGML) handling of HTML comments. Recursive accept/reject:-A,--accept=list comma-separated LIST of accepted extensions. -R,--reject=list comma-separated LIST of rejected extensions. --accept-regex=regex regex matching accepted URLs. --reject-regex=regex regex matching rejected URLs. --regex-type=type regex type (POSIX). -D,--domains=list comma-separated LIST of accepted domains. --exclude-domains=list comma-separated LIST of rejected domains. --follow-ftp follow FTP links from HTML documents. --follow-tags=list comma-separated List of followed HTML tags. --ignore-tags=list comma-separated LIST of ignored HTML tags. -H,--span-hosts go to foreign the hosts when recursive. -L,--relative follow relative links only. -I,--include-directories=list LIST of allowed directories. --trust-server-names use the name specified by the redirection URL last component. -X,--exclude-directories=list LIST of excluded directories. -NP,--no-parent don ' t ascend to the parent directory. Mail bug reports and suggestions to <[email protected]>.

Wget try

Based on Wget's help document, I tried the following command

Wget-r-np-pk-nh-p./download http://www.baidu.com

Explain these parameters

-r recursively Download all content

-NP downloads only the content under the given URL, not its parent content

-P Download all resources needed for the page, including pictures and CSS styles

-K Converts an absolute path to a relative path (this is important in order to have the relevant resources loaded locally when the user opens the Web page)

-NH disallow wget to create a folder for the name of the received URL (if not, this command will present the downloaded content./download/www.baidu.com/)

-P download to which path, here is the current folder under the download folder, if not, wget will help you automatically create

These options are in line with the current demand, the single result is very unexpected, not as simple as we thought, Wget didn't give us what we wanted.

If you execute this command, you will find that only one index.html and one robots.txt are downloaded in the current download folder, and the images required for index.html files are not downloaded .

the path in the tag is not replaced by a relative path, it may just be the "http:" String.

As to why this is so, please continue to look down.

wget Positive Solution

Because the above command does not work, so the brain hole full. Come on, let's write a shell script with the name WGET_CC content as follows

#!/bin/shurl= "$ path=" $ "echo" Download URL: $URL "echo" Download dir: $PATH "/usr/bin/wget-e robots=off-w 1-xq-np-nh -pk-m-  t 1-p "$PATH" "$URL" echo "Success to download"

Note that my wget is in the/usr/bin directory (which must be written in full path), you can use which wget this command to determine where your wget path is, and then replace it with the script.

Here are a few more parameters to explain:

-e usage is '-e command '

Used to perform additional. WGETRC commands. Just as the configuration of vim exists in the. vimrc file, Wget also uses the. wgetrc file to store its configuration. This means that the configuration commands in the. wgetrc file are executed before the wget executes. A typical. wgetrc file can be referenced by:

Http://www.gnu.org/software/wget/manual/html_node/Sample-Wgetrc.html

Http://www.gnu.org/software/wget/manual/html_node/Wgetrc-Commands.html

users can specify additional configuration commands with the-e option without overwriting the. wgetrc file. If you want to develop multiple configuration commands,-e command1-e command2 ...-e commandn. These configuration commands are executed after all commands in the. WGETRC, and therefore overwrite the same configuration items in the. Wgetrc.

here Robots=off is because wget by default will be based on the robots.txt of the site to operate, if Robots.txt is user-agent: * Disallow :/ , wget is not able to mirror or download the directory.

That's why you can't download pictures and other resources in the first place, because the host you're going to crawl does not allow spiders to crawl it, and wget uses the-e robots=off option to bypass this restriction.

-X Create a directory structure for the mirror site

-Q Silent Download, that is, do not display the download information, if you want to know wget is currently downloading what resources, you can remove this option

-M it will open the image-related options, such as a subdirectory of infinite depth recursively downloaded.

-T times retry downloads after a resource download fails

-W seconds wait time between requests for resources to download (eases server pressure)

If you don't know what you're doing, dig up the documentation.

Write well after save exit, execute:

chmod 744 WGET_CC

OK, so the script can be executed directly, instead of having/bin/sh with SH to explain it before each command.

Let's get the script done!

./wget_cc./download http://www.baidu.com

Directory structure after download is complete

OK, and then look at the src attribute in the tab,

src = "Img/bd_logo1.png"

Sure enough to replace the relative path ah, done, feel that you have the help of please point a praise it!

here is [email protected], welcome to Exchange.

I original works, reproduced please indicate the source.

Crawl Web pages and pictures using the Wget tool

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More