Python web crawler: the initial web crawler.
The first time I came into contact with python was a very accidental factor. Since I often read serialized novels on the Internet, many novels are serialized in hundreds of times. Therefore, I want to know if I can use a tool to automatically download these novels and copy them to my computer or mobile phone. In this way, I can read them when there is no network or the network signal is poor. At that time, I did not know the concept of web crawler. C programming is used most in work and study, but for the online world, C is indeed not a good voice, and C is more oriented to hardware and kernel. Based on the idea of downloading online novels, I realized python. After using python, I thought it was a suitable language for the Internet, and countless third-party libraries could be used. Suitable for fast development. Of course, python also has many advantages in data analysis and natural semantics. This section describes the network applications.
When it comes to networks, the closest thing to us is the Web page. The main technology of Web pages is http, and of course there are a lot of front-end and back-end stuff such as javascript, XML, JSON, and TCP connections. The http knowledge is not described here, we recommend you read the http authoritative guide.
Web pages are all written in the html language. There are a lot of introductions on the HTML language W3CSCHOOL. Web crawlers are mainly used for HTML. Instead of the following Baidu interface, use google's browser to click F12, IE right-click, and select view web page source code. On the left is the Baidu page we see online, and on the right is the html source code. Javascript is included in the script. This page is mainly a Dynamically Loaded page, and the displayed content is mainly driven by javascript. Not intuitive yet. Next let's look at a simpler
Right-click Baidu and select the review element. The HMTL code is displayed.
Specific Code: we can see that Baidu's words are in the input element, which indicates that this is an input box.
Someone may ask, what does this have to do with web crawlers and download novels? Don't worry. The previous article is just a webpage introduction. Next, let's look at a novel interface: Below is the novel of the fast reading network, the novel text on the left, and the relevant webpage code on the right. No. The text of all novels is contained in the elements whose tags are <div> and whose id = "content_1 ".
If we have a tool, we can automatically download the corresponding HTML code elements. You can automatically download the novel. This is the Web Crawler function. To put it bluntly, Web Crawler parses the HMTL code, saves it, And then performs post-processing. Simply put, there are three steps: 1. parse the webpage to get data, 2. Save the data, and 3. Post-process the data. Next we will first start from parsing the web page to get data.
To access a webpage, you must first request a URL, that is, a URL link. Python provides the urllib2 function for link. The details are as follows:
import urllib2
req=urllib2.Request('http://www.baidu.com.cn')
fd=urllib2.urlopen(req)
The first parameter in the Request is the URL link, which can carry the header information and the information to be passed to the URL. This is abstract. We use wireshark to capture an online packet. Enter www.sina.com.cn in the google browser to view the following information. This is the request sent from the computer. There are several key information: Request Method: Get. There are two methods: Get and Post. Get is mainly used for Request data, and Post can be used to submit data.
User-Agent refers to User code. Through these messages, the server can identify the operating system and browser used by the customer. Generally, the server can identify whether it is a crawler. Later
The Referer can be considered as the URL you need to request from the server. here we can see that sina
Accet-Encoding: This indicates that the computer can accept the data compression method.
Is the packet capture result obtained by entering the URL on the browser. What will happen if we run the program. Is the result of the python code just now. The access URL is Baidu. The following shows the obvious difference. The most important thing is that the User-Agent becomes the Python-urllib2/2.7. this field gives the server a clear prompt that this is a Web page Link initiated by a program, that is, crawler, rather than being accessed by people sitting in front of the computer. Because crawlers perform underlying links such as TCP when performing a link, to prevent large-scale crawlers from simultaneously crawling webpages. The server determines Based on the User-Agent. If it is a crawler, the server rejects the request directly.
What should we do to prevent the server from blocking our application. We construct a User_Agent that is the same as a real browser in the program.
user_agent="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"
headers={'User-Agent':user_agent}
req=urllib2.Request('http://www.baidu.com.cn','',headers)
fd=urllib2.urlopen(req)
Added the headers description. The first Request parameter is the URL, the second parameter is the submitted data, and the third parameter is the header information. The second parameter is empty. The third parameter adds header information in the form of a dictionary. The packet capture information is as follows. This becomes the same form as the browser, so that the server will not regard it as a crawler. The next step is to capture webpage data with confidence.
Some may ask, what should I do if I accidentally lose the wrong URL. This requires the exception protection mechanism of python. The code can be modified as follows:
try:
user_agent="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"
headers={'User-Agent':user_agent}
req=urllib2.Request('http://www.baidu.com.cn','',headers)
fd=urllib2.urlopen(req)
print fd.read().decode('utf-8').encode('GB18030')
html=BeautifulSoup(fd.read(),"lxml")
# print html.encode('gbk')
except urllib2.URLError,e:
print e.reason
A protection mechanism is added. URLError is generated when there is no network connection or the server does not exist. In this case, exceptions usually carry the reason attribute. The HTTP Error Code is as follows. For details, refer to the HTTP authoritative guide.
200: Successful request processing method: Obtain the response content for processing
201: the request is complete. The result is that a new resource is created. The URI of the newly created resource can be processed in the response object.
202: the request is accepted, but the processing has not been completed. Processing Method: Blocking wait
204: the server has implemented the request, but no new message is returned. If the customer is a user agent, you do not need to update the document view for this. Processing Method: discard
300: this status code is not directly used by HTTP/1.0 applications, but is used as the default explanation for 3XX type responses. Multiple available requested resources exist. Processing Method: if it can be processed in the program, it will be further processed. If it cannot be processed in the program, it will be discarded.
301: all requested resources will be allocated with a permanent URL. In this way, you can use this URL to access this resource in the future. Processing Method: redirect to the allocated URL.
302: the requested resource is temporarily saved in a different URL. Processing Method: redirect to a temporary URL.
304 request resource not Updated Handling Method: discard
400 Illegal Request Handling Method: discard
401 unauthorized handling method: discard
403 Forbidden processing method: discard
404 no handling method found: discard
The status code starting with "5" in the 5XX response code indicates that the server finds an error and cannot continue to execute the request processing method: discard
The following figure shows the obtained webpage information. The Request returns an object for obtaining the webpage, And the urlopen is used to open the webpage fd. read () to print the specific information of the webpage.
The Code contains the decode and encode messages. Why is this used. This is mainly for Chinese characters on webpages. Chinese output before Python3 is a very sad thing.
print fd.read().decode('utf-8').encode('GB18030')
Data on the webpage also has its own encoding method. From the following webpage code, we can see that the encoding method is UTF-8, while in windows, the Chinese encoding method is GBK.
Therefore, if no encoding conversion is performed, Chinese characters on the webpage will be garbled:
So can we get the webpage encoding method in advance. The following code returns the encoding method returned by the webpage.
fd1=urllib2.urlopen(req).info()
print fd1.getparam('charset')
At this point, we have successfully made a webpage link and obtained the webpage content. The next step is to parse the webpage. The following describes common crawler tools such as beautifulSoup, lxml, HTMLParser, scrapy, and selenium.