When you mention Python as a web crawler, you have to talk about powerful component Urllib2. It is in Python that you use the URLLIB2 component to crawl Web pages. URLLIB2 is a component of Python's acquisition of URLs (Uniform Resource Locators). It provides a very simple interface in the form of a urlopen function. The following code is simple to feel the function of urllib2;
Import URLLIB2 response = Urllib2.urlopen (' http://www.baidu.com/') HTML = response.read () print HTML
The result of operation is as follows;
View http://www.baidu.com/Source code discovery is exactly the same as the result of the above operation. Here's the URL in addition to http: It can also be ftp: or file:
URLLIB2 uses a Request object to map the proposed HTTP request. You can create a request object that, by calling Urlopen and passing in the Request object, will return a related requested response object, which is like a file object, so you can call. Read () in response. Modify the code as follows;
Import Urllib2 req = Urllib2. Request (' http://www.baidu.com ') response = Urllib2.urlopen (req) page = response.read () print page
Discover that the results are the same as before the changes. You also need to do the following before the HTTP request 1, send form data. 2, set headers information.
1, send the form data; common in analog logins, the general need to send data to the server during logon operations. The main use of the Post method, the general HTML form, data needs to be encoded into a standard form. The data parameter is then passed to the request object. Coding works using Urllib functions rather than urllib2. The test code is as follows
Import urllib import urllib2 url = ' http://www.server.com/register.php ' postdata = {' Useid ': ' User ', ' pwd ' : ' * * * ', ' language ': ' Python '} data = Urllib.urlencode (postdata) # coding Work req = Urllib2. Request (URL, data) # sends the requests simultaneously to the data response = Urllib2.urlopen (req) #接受反馈的信息 page = Response.read () #读取反馈的内容
At the same time Urllib2 can also use the Get method to transfer data. The code is as follows;
Import URLLIB2 Import urllib data = {} data[' useid '] = ' user ' data[' pwd '] = ' * * * ' data[' language '] = ' Python ' Values = Urllib.urlencode (data) print values Name=somebody+here&language=python&location=northampton url = ' http://www.example.com/example.php ' full_url = URL + '? ' + url_values data = Urllib2.open (Full_url)
2, set headers information; some sites have limited access to the source, so here the simulation user-agent header, the code is as follows;
Import urllib import urllib2 url = ' http://www.server.com/register.php ' user_agent = ' mozilla/5.0 (Windows NT 6.1; rv:33.0) gecko/20100101 firefox/33.0 ' values = {' Useid ': ' User ', ' pwd ': ' * * * ', ' language ': ' Python '} Heade rs = {' User-agent ': user_agent} data = Urllib.urlencode (values) req = Urllib2. Request (URL, data, headers) response = Urllib2.urlopen (req) page = Response.read ()
URLLIB2 introduced here!
Exception handling
Typically urlerror occurs when there is no network connection or when the server address is unreachable, in which case the exception will have the Resaon attribute containing the error number and error message. The following code tests the effect;
Import urllib import urllib2 url = ' http://www.server.com/register.php ' user_agent = ' mozilla/5.0 (Windows NT 6.1; rv:33.0) gecko/20100101 firefox/33.0 ' values = {' Useid ': ' User ', ' pwd ': ' * * * ', ' language ': ' Python '} Heade rs = {' User-agent ': user_agent} data = Urllib.urlencode (values) req = Urllib2. Request (URL, data, headers) response = Urllib2.urlopen (req) page = Response.read ()
errno 10061 indicates that the server side is actively rejecting the information.
In addition to the httperror, when a normal connection is established between the client and the server, URLLIB2 will begin processing the relevant data. If you encounter a situation that can not be processed, it will produce the corresponding httperror, such as the common error code "404″ (page cannot be found)," 403″ (Request Forbidden), and "401″ (with authentication request) etc... The HTTP status code indicates the HTTP protocol response, and the common status code is described in the HTTP status code.
The httperror will have a ' code ' attribute, which is the error number sent by the server. When a httperror is generated, the server returns a related error number and error page. The following code validation;
Import urllib2 req = urllib2. Request (' http://www.python.org/callmewhy ') try: urllib2.urlopen (req) except URLLIB2. Urlerror, E:
The output 404 code indicates that the page could not be found.
Catch an exception and process ... The implementation code is as follows;
#-*-coding:utf-8-*-from urllib2 import Request, Urlopen, urlerror, httperror req = Request (' HTTP://WWW.PYTHON.ORG/CALLM Ewhy ') Try: response = Urlopen (req) except Urlerror, E: if Hasattr (E, ' Code '): print ' server does not respond properly to this request! ' print ' Error code: ', E.code elif hasattr (E, ' reason '): print ' cannot connect to the server ' print ' reason: ', e.reason< C8/>else: print ' does not appear abnormal '
Catch the exception successfully!