Web page capture in Linux (curl + wget)

Source: Internet
Author: User
Tags website server ftp protocol

Linux crawls web pages by using curl or wget commands.

Curl and wget commands are supported on Linux and Windows platforms.

Curl and wget support protocols

Curl supports HTTP, https, FTP, ftps, SCP, telnet, and other network protocols. For more information, see man curl.

Wget supports HTTP, https, and FTP protocols. For more information, see man wget.

Download and install curl and wget

1. Ubuntu Platform

Run the wget command to install sudo apt-Get install wget. For normal users, enter the password. For root users, do not enter the password)

Curl command installation: sudo apt-Get install curl (same as wget)

2. Windows Platform

Wget: wget for Windows

Curl: curl download

Package wget and curl: wget and curl toolkit on Windows

In Windows, the curl.exe format is directly copied to the system command directory c: \ windows \ system32.

In Windows, wgetdownload is in the wget-1.11.4-1-setup.exe format and needs to be installed. After installation, add the installation directory to the Environment Variable-system variable-path.

Curl and wget capture instances

Webpage capturing mainly involves two methods: URL and proxy. The following uses crawling the "Baidu" homepage as an example to introduce

1. url crawling

(1) download the content of Baidu homepage from curl and save it in the baidu_html file.

Http://www.baidu.com/-O baidu_html

(2) wget downloads the content of the Baidu homepage and stores it in the baidu_html file.

Wget http://www.baidu.com/-O baidu_html2

Sometimes, the webpage cannot be successfully downloaded due to network speed, data packet loss, server downtime, or other reasons

In this case, you may need to try to send a connection multiple times to request the server's response. If there is still no response for multiple times, you can confirm that the server has a problem.

(1) curl tries to connect multiple times

Curl -- retry 10 -- retry-delay 60 -- retry-max-time 60 http://www.baidu.com/-O baidu_html

Note: -- retry indicates the number of retries; -- retry-delay indicates the interval (in seconds) between two retries ); -- retry-max-time indicates that only one Retry is allowed during this maximum time (generally the same as -- retry-delay)

(2) wget tries to connect multiple times

Wget-T 10-W 60-T 30 http://www.baidu.com/-O baidu_html2

Note:-T (-- tries) indicates the number of retries.-W indicates the time interval between two retries (in seconds).-T indicates the connection timeout time, if the connection times out, the connection fails. Continue to try the next connection.

Appendix:Curl determines whether the server responds. It can also indirectly determine whether the server responds by downloading the obtained bytes within a period of time. The command format is as follows:

Curl-Y 60-Y 1-M 60 http://www.baidu.com/-O baidu_html

Note:-y indicates the time when the network speed is tested.-y indicates the number of bytes downloaded during the period (in bytes).-M indicates the maximum time allowed for connection requests, otherwise, the connection is automatically disconnected and the connection is abandoned.

2. Proxy crawling

Proxy downloading is a process of indirectly downloading the URL webpage by connecting to an intermediate server, instead of directly connecting to the website server for downloading.

Two famous free proxy websites:

Freeproxylists.net (free agents in dozens of countries around the world, updated every day)

Xroxy.com (filter by setting the port type, proxy type, and country name)

On the freeproxylists.net website, select a free proxy server in China as an example to introduce proxy crawling webpage:

218.107.21.252: 8080 (IP: 218.107.21.252; port: 8080, separated by a colon (:) to form a socket)

(1) curl crawls Baidu homepage through proxy

Curl-x 218.107.21.252: 8080-o aaaaa http://www.baidu.com (port common 80, 80 80, 80 3128, 8888, etc., the default is 80)

Note:-X indicates the proxy server (IP: port), that is, the curl first connects to the proxy server 218.107.21.252: 8080, and then downloads the Baidu homepage through 218.107.21.252: 8080: 8080 send the downloaded Baidu homepage to the local computer (curl is not directly connected to the Baidu server to download the homepage, but is completed through an intermediary proxy)

(2) wget crawls Baidu homepage through proxy

Wget is downloaded through a proxy, which is not the same as curl. You must first set http_proxy = IP: port of the proxy server.

Take Ubuntu as an example. In the current user directory (Cd ~), Create a wget configuration file (. wgetrc) and enter the proxy configuration:

Http_proxy = 218.107.21.252: 8080

Then enter the wget command to capture the webpage:

Wget http://www.baidu.com-O baidu_html2


Download Proxy:

======================================

Crawled Baidu homepage data ():

Other command parameter usage, which is the same as the URL method, will not be described here

For more curl and wget usage such as FTP protocol and iterative sub-directories, see the help manual for man.

Knowledge Development:

In China, for some reason, it is generally difficult to directly access some foreign sensitive websites, which can only be accessed through a VPN or proxy server.

If the campus network and CERNET have IPv6, you can use sixxs.org to access websites such as Facebook, Twitter, and livi.

In fact, in addition to VPN and IPv6 + sixxs.org proxy, common users still have other ways to access foreign websites.

The following describes two famous free proxy websites:

Freeproxylists.net (free agents in dozens of countries around the world, updated every day)

Xroxy.com (filter by setting the port type, proxy type, and country name)

Curl project instance

Using curl + freeproxylists.net free proxy, the web page capture and trend chart query of the world's 12-country Google Play game rankings are realized (the crawling web page module is all written using shell, and the core code is about 1000 lines)

For the game ranking trend chart, see my previous blog: jfreechart project instance.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.