In our daily surfing the Web page, often see some good-looking pictures, we would like to save these images to download, or users to do desktop wallpaper, or used to make design material. The following article on the introduction of the use of Python to achieve the simplest web crawler related information, the need for friends can refer to the following to see together.
Objective
Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script. Recently, the Python crawler has a strong interest in sharing their learning path, welcome to make suggestions. We communicate with each other and make progress together. Words not much to say, come together to see the detailed introduction:
1. Development tools
I use the tool is sublime text3, it's short (maybe the men do not like the word) I am very fascinated. Recommended for everyone, of course, if your computer is well configured, Pycharm may be more suitable for you.
Sublime Text3 Building Python development environment recommended view this article:
[Sublime build python development environment]
Crawler, as the name implies, is like a worm, crawling on the internet this big online. That way, we can get what we want.
Now that we are crawling on the Internet, we need to know the URL, Fahao "Uniform Resource Locator", and nickname "Link". Its structure is mainly composed of three parts:
(1) Agreement: such as the HTTP protocol that we common in the URL.
(2) Domain name or IP address: domain name, such as: Www.baidu.com,IP address, will be the domain name after the corresponding IP.
(3) Path: directory or file, etc.
3.urllib development of the simplest crawler
(1) Urllib Introduction
Module |
introduce |
Urllib.error |
Exception classes raised by Urllib.request. |
Urllib.parse |
Parse URLs into or assemble them from. |
Urllib.request |
Extensible library for opening URLs. |
Urllib.response |
Response classes used by Urllib. |
Urllib.robotparser |
Load a robots.txt file and answer questions about fetchability of other URLs. |
(2) Development of the simplest crawler
Baidu homepage Simple and generous, very suitable for our crawler.
The crawler code is as follows:
From urllib import Requestdef visit_baidu (): url = "http://www.baidu.com" # Open the URL req = Request.urlopen (URL) # Read The URL html = req.read () # decode the URL to utf-8 HTML = Html.decode ("Utf_8") print (HTML) if __name__ = = ' __main__ ': Visit_baidu ()
Results such as:
We can compare the results of our operation by right-clicking on the Baidu Home page and viewing the review elements.
Of course, request can also generate a request object that can be opened with the Urlopen method.
The code is as follows:
From urllib import Requestdef Vists_baidu (): # Create a request Obkect req = Request. Request (' http://www.baidu.com ') # Open the Request object response = Request.urlopen (req) # Read the response HTML = r Esponse.read () HTML = Html.decode (' utf-8 ') print (HTML) if __name__ = = ' __main__ ': Vists_baidu ()
The result is the same as just before.
(3) Error handling
Error handling is handled through the Urllib module, mainly with Urlerror and httperror errors, where httperror errors are subclasses of Urlerror errors, Httrperror can also be captured by Urlerror.
Httperror can be captured by its Code property.
The code to process Httperror is as follows:
From urllib import requestfrom urllib import Errordef Err (): url = "https://segmentfault.com/zzz" req = Request. Request (URL) try:response = Request.urlopen (req) HTML = Response.read (). Decode ("Utf-8") print (HTML) except error. Httperror as E:print (e.code) if __name__ = = ' __main__ ': ERR ()
Run results
404 to print out the error code, about this detailed information you can self-Baidu.
Urlerror can be captured by its reason property.
The code for Chulihttperror is as follows:
From urllib import requestfrom urllib import Errordef Err (): url = "https://segmentf.com/" req = Request. Request (URL) try:response = Request.urlopen (req) HTML = Response.read (). Decode ("Utf-8") print (HTML) except error. Urlerror as E:print (e.reason) if __name__ = = ' __main__ ': ERR ()
Run results
Since in order to deal with errors, it is best to write two errors into the code, after all, the more detailed the clearer. Note that Httperror is a subclass of Urlerror, so be sure to put Httperror in front of urlerror, otherwise it will output urlerror, such as 404 output to Not Found.
The code is as follows:
From urllib import requestfrom urllib Import error# first method, Urlerroe and Httperrordef Err (): url = "Https://segmentfault.com/zz Z "req = Request. Request (URL) try:response = Request.urlopen (req) HTML = Response.read (). Decode ("Utf-8") print (HTML) except error. Httperror as E:print (e.code) except error. Urlerror as E:print (E.reason)
You can change the URL to see the various error output forms.
Summarize