A simple walkthrough of using the Urllib2 module to write crawlers in Python

Source: Internet
Author: User
When you mention Python as a web crawler, you have to talk about powerful component Urllib2. It is in Python that you use the URLLIB2 component to crawl Web pages. URLLIB2 is a component of Python's acquisition of URLs (Uniform Resource Locators). It provides a very simple interface in the form of a urlopen function. The following code is simple to feel the function of urllib2;

Import URLLIB2 response = Urllib2.urlopen (' http://www.baidu.com/') HTML = response.read () print HTML

The result of operation is as follows;

View http://www.baidu.com/Source code discovery is exactly the same as the result of the above operation. Here's the URL in addition to http: It can also be ftp: or file:
URLLIB2 uses a Request object to map the proposed HTTP request. You can create a request object that, by calling Urlopen and passing in the Request object, will return a related requested response object, which is like a file object, so you can call. Read () in response. Modify the code as follows;

Import Urllib2 req = Urllib2. Request (' http://www.baidu.com ') response = Urllib2.urlopen (req) page = response.read () print page

Discover that the results are the same as before the changes. You also need to do the following before the HTTP request 1, send form data. 2, set headers information.
1, send the form data; common in analog logins, the general need to send data to the server during logon operations. The main use of the Post method, the general HTML form, data needs to be encoded into a standard form. The data parameter is then passed to the request object. Coding works using Urllib functions rather than urllib2. The test code is as follows

Import urllib import urllib2  url = ' http://www.server.com/register.php '  postdata = {' Useid ': ' User ',    ' pwd ' : ' * * * ',    ' language ': ' Python '}  data = Urllib.urlencode (postdata) # coding Work req = Urllib2. Request (URL, data) # sends the requests simultaneously to the data response = Urllib2.urlopen (req) #接受反馈的信息 page = Response.read () #读取反馈的内容

At the same time Urllib2 can also use the Get method to transfer data. The code is as follows;

Import URLLIB2 Import urllib  data = {}  data[' useid '] = ' user ' data[' pwd '] = ' * * * ' data[' language '] = ' Python '  Values = Urllib.urlencode (data) print values  Name=somebody+here&language=python&location=northampton url = ' http://www.example.com/example.php ' full_url = URL + '? ' + url_values  data = Urllib2.open (Full_url)

2, set headers information; some sites have limited access to the source, so here the simulation user-agent header, the code is as follows;

Import urllib import urllib2  url = ' http://www.server.com/register.php '  user_agent = ' mozilla/5.0 (Windows NT 6.1; rv:33.0) gecko/20100101 firefox/33.0 ' values = {' Useid ': ' User ',    ' pwd ': ' * * * ',    ' language ': ' Python '}  Heade rs = {' User-agent ': user_agent} data = Urllib.urlencode (values) req = Urllib2. Request (URL, data, headers) response = Urllib2.urlopen (req) page = Response.read ()

URLLIB2 introduced here!

Exception handling
Typically urlerror occurs when there is no network connection or when the server address is unreachable, in which case the exception will have the Resaon attribute containing the error number and error message. The following code tests the effect;

Import urllib import urllib2  url = ' http://www.server.com/register.php '  user_agent = ' mozilla/5.0 (Windows NT 6.1; rv:33.0) gecko/20100101 firefox/33.0 ' values = {' Useid ': ' User ',    ' pwd ': ' * * * ',    ' language ': ' Python '}  Heade rs = {' User-agent ': user_agent} data = Urllib.urlencode (values) req = Urllib2. Request (URL, data, headers) response = Urllib2.urlopen (req) page = Response.read ()

errno 10061 indicates that the server side is actively rejecting the information.
In addition to the httperror, when a normal connection is established between the client and the server, URLLIB2 will begin processing the relevant data. If you encounter a situation that can not be processed, it will produce the corresponding httperror, such as the common error code "404″ (page cannot be found)," 403″ (Request Forbidden), and "401″ (with authentication request) etc... The HTTP status code indicates the HTTP protocol response, and the common status code is described in the HTTP status code.
The httperror will have a ' code ' attribute, which is the error number sent by the server. When a httperror is generated, the server returns a related error number and error page. The following code validation;

Import urllib2  req = urllib2. Request (' http://www.python.org/callmewhy ')  try:  urllib2.urlopen (req)  except URLLIB2. Urlerror, E:   

The output 404 code indicates that the page could not be found.
Catch an exception and process ... The implementation code is as follows;


#-*-coding:utf-8-*-from urllib2 import Request, Urlopen, urlerror, httperror req = Request (' HTTP://WWW.PYTHON.ORG/CALLM Ewhy ') Try:   response = Urlopen (req)  except Urlerror, E:   if Hasattr (E, ' Code '):    print ' server does not respond properly to this request! '    print ' Error code: ', E.code   elif hasattr (E, ' reason '):    print ' cannot connect to the server '    print ' reason: ', e.reason< C8/>else:   print ' does not appear abnormal '

Catch the exception successfully!

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.