Linux crawls web pages by using curl or wget commands.
Curl and wget commands are supported on Linux and Windows platforms.
Curl and wget support protocols
Curl supports HTTP, https, FTP, ftps, SCP, telnet, and other network protocols. For more information, see man curl.
Wget supports HTTP, https, and FTP protocols. For more information, see man wget.
Download and install curl and wget
1. Ubuntu Platform
Run the wget command to install sudo apt-Get install wget. For normal users, enter the password. For root users, do not enter the password)
Curl command installation: sudo apt-Get install curl (same as wget)
2. Windows Platform
Wget: wget for Windows
Curl: curl download
Package wget and curl: wget and curl toolkit on Windows
In Windows, the curl.exe format is directly copied to the system command directory c: \ windows \ system32.
In Windows, wgetdownload is in the wget-1.11.4-1-setup.exe format and needs to be installed. After installation, add the installation directory to the Environment Variable-system variable-path.
Curl and wget capture instances
Webpage capturing mainly involves two methods: URL and proxy. The following uses crawling the "Baidu" homepage as an example to introduce
1. url crawling
(1) download the content of Baidu homepage from curl and save it in the baidu_html file.
Http://www.baidu.com/-O baidu_html
(2) wget downloads the content of the Baidu homepage and stores it in the baidu_html file.
Wget http://www.baidu.com/-O baidu_html2
Sometimes, the webpage cannot be successfully downloaded due to network speed, data packet loss, server downtime, or other reasons
In this case, you may need to try to send a connection multiple times to request the server's response. If there is still no response for multiple times, you can confirm that the server has a problem.
(1) curl tries to connect multiple times
Curl -- retry 10 -- retry-delay 60 -- retry-max-time 60 http://www.baidu.com/-O baidu_html
Note: -- retry indicates the number of retries; -- retry-delay indicates the interval (in seconds) between two retries ); -- retry-max-time indicates that only one Retry is allowed during this maximum time (generally the same as -- retry-delay)
(2) wget tries to connect multiple times
Wget-T 10-W 60-T 30 http://www.baidu.com/-O baidu_html2
Note:-T (-- tries) indicates the number of retries.-W indicates the time interval between two retries (in seconds).-T indicates the connection timeout time, if the connection times out, the connection fails. Continue to try the next connection.
Appendix:Curl determines whether the server responds. It can also indirectly determine whether the server responds by downloading the obtained bytes within a period of time. The command format is as follows:
Curl-Y 60-Y 1-M 60 http://www.baidu.com/-O baidu_html
Note:-y indicates the time when the network speed is tested.-y indicates the number of bytes downloaded during the period (in bytes).-M indicates the maximum time allowed for connection requests, otherwise, the connection is automatically disconnected and the connection is abandoned.
2. Proxy crawling
Proxy downloading is a process of indirectly downloading the URL webpage by connecting to an intermediate server, instead of directly connecting to the website server for downloading.
Two famous free proxy websites:
Freeproxylists.net (free agents in dozens of countries around the world, updated every day)
Xroxy.com (filter by setting the port type, proxy type, and country name)
On the freeproxylists.net website, select a free proxy server in China as an example to introduce proxy crawling webpage:
218.107.21.252: 8080 (IP: 218.107.21.252; port: 8080, separated by a colon (:) to form a socket)
(1) curl crawls Baidu homepage through proxy
Curl-x 218.107.21.252: 8080-o aaaaa http://www.baidu.com (port common 80, 80 80, 80 3128, 8888, etc., the default is 80)
Note:-X indicates the proxy server (IP: port), that is, the curl first connects to the proxy server 218.107.21.252: 8080, and then downloads the Baidu homepage through 218.107.21.252: 8080: 8080 send the downloaded Baidu homepage to the local computer (curl is not directly connected to the Baidu server to download the homepage, but is completed through an intermediary proxy)
(2) wget crawls Baidu homepage through proxy
Wget is downloaded through a proxy, which is not the same as curl. You must first set http_proxy = IP: port of the proxy server.
Take Ubuntu as an example. In the current user directory (Cd ~), Create a wget configuration file (. wgetrc) and enter the proxy configuration:
Http_proxy = 218.107.21.252: 8080
Then enter the wget command to capture the webpage:
Wget http://www.baidu.com-O baidu_html2
Download Proxy:
======================================
Crawled Baidu homepage data ():
Other command parameter usage, which is the same as the URL method, will not be described here
For more curl and wget usage such as FTP protocol and iterative sub-directories, see the help manual for man.
Knowledge Development:
In China, for some reason, it is generally difficult to directly access some foreign sensitive websites, which can only be accessed through a VPN or proxy server.
If the campus network and CERNET have IPv6, you can use sixxs.org to access websites such as Facebook, Twitter, and livi.
In fact, in addition to VPN and IPv6 + sixxs.org proxy, common users still have other ways to access foreign websites.
The following describes two famous free proxy websites:
Freeproxylists.net (free agents in dozens of countries around the world, updated every day)
Xroxy.com (filter by setting the port type, proxy type, and country name)
Curl project instance
Using curl + freeproxylists.net free proxy, the web page capture and trend chart query of the world's 12-country Google Play game rankings are realized (the crawling web page module is all written using shell, and the core code is about 1000 lines)
For the game ranking trend chart, see my previous blog: jfreechart project instance.