A simple walkthrough of using the Urllib2 module to write crawlers in Python

Last Update:2016-06-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When you mention Python as a web crawler, you have to talk about powerful component Urllib2. It is in Python that you use the URLLIB2 component to crawl Web pages. URLLIB2 is a component of Python's acquisition of URLs (Uniform Resource Locators). It provides a very simple interface in the form of a urlopen function. The following code is simple to feel the function of urllib2;

Import URLLIB2 response = Urllib2.urlopen (' http://www.baidu.com/') HTML = response.read () print HTML

The result of operation is as follows;

View http://www.baidu.com/Source code discovery is exactly the same as the result of the above operation. Here's the URL in addition to http: It can also be ftp: or file:
URLLIB2 uses a Request object to map the proposed HTTP request. You can create a request object that, by calling Urlopen and passing in the Request object, will return a related requested response object, which is like a file object, so you can call. Read () in response. Modify the code as follows;

Import Urllib2 req = Urllib2. Request (' http://www.baidu.com ') response = Urllib2.urlopen (req) page = response.read () print page

Discover that the results are the same as before the changes. You also need to do the following before the HTTP request 1, send form data. 2, set headers information.
1, send the form data; common in analog logins, the general need to send data to the server during logon operations. The main use of the Post method, the general HTML form, data needs to be encoded into a standard form. The data parameter is then passed to the request object. Coding works using Urllib functions rather than urllib2. The test code is as follows

Import urllib import urllib2  url = ' http://www.server.com/register.php '  postdata = {' Useid ': ' User ',    ' pwd ' : ' * * * ',    ' language ': ' Python '}  data = Urllib.urlencode (postdata) # coding Work req = Urllib2. Request (URL, data) # sends the requests simultaneously to the data response = Urllib2.urlopen (req) #接受反馈的信息 page = Response.read () #读取反馈的内容

At the same time Urllib2 can also use the Get method to transfer data. The code is as follows;

Import URLLIB2 Import urllib  data = {}  data[' useid '] = ' user ' data[' pwd '] = ' * * * ' data[' language '] = ' Python '  Values = Urllib.urlencode (data) print values  Name=somebody+here&language=python&location=northampton url = ' http://www.example.com/example.php ' full_url = URL + '? ' + url_values  data = Urllib2.open (Full_url)

2, set headers information; some sites have limited access to the source, so here the simulation user-agent header, the code is as follows;

Import urllib import urllib2  url = ' http://www.server.com/register.php '  user_agent = ' mozilla/5.0 (Windows NT 6.1; rv:33.0) gecko/20100101 firefox/33.0 ' values = {' Useid ': ' User ',    ' pwd ': ' * * * ',    ' language ': ' Python '}  Heade rs = {' User-agent ': user_agent} data = Urllib.urlencode (values) req = Urllib2. Request (URL, data, headers) response = Urllib2.urlopen (req) page = Response.read ()

URLLIB2 introduced here!

Exception handling
Typically urlerror occurs when there is no network connection or when the server address is unreachable, in which case the exception will have the Resaon attribute containing the error number and error message. The following code tests the effect;

Import urllib import urllib2  url = ' http://www.server.com/register.php '  user_agent = ' mozilla/5.0 (Windows NT 6.1; rv:33.0) gecko/20100101 firefox/33.0 ' values = {' Useid ': ' User ',    ' pwd ': ' * * * ',    ' language ': ' Python '}  Heade rs = {' User-agent ': user_agent} data = Urllib.urlencode (values) req = Urllib2. Request (URL, data, headers) response = Urllib2.urlopen (req) page = Response.read ()

errno 10061 indicates that the server side is actively rejecting the information.
In addition to the httperror, when a normal connection is established between the client and the server, URLLIB2 will begin processing the relevant data. If you encounter a situation that can not be processed, it will produce the corresponding httperror, such as the common error code "404″ (page cannot be found)," 403″ (Request Forbidden), and "401″ (with authentication request) etc... The HTTP status code indicates the HTTP protocol response, and the common status code is described in the HTTP status code.
The httperror will have a ' code ' attribute, which is the error number sent by the server. When a httperror is generated, the server returns a related error number and error page. The following code validation;

Import urllib2  req = urllib2. Request (' http://www.python.org/callmewhy ')  try:  urllib2.urlopen (req)  except URLLIB2. Urlerror, E:

The output 404 code indicates that the page could not be found.
Catch an exception and process ... The implementation code is as follows;

#-*-coding:utf-8-*-from urllib2 import Request, Urlopen, urlerror, httperror req = Request (' HTTP://WWW.PYTHON.ORG/CALLM Ewhy ') Try:   response = Urlopen (req)  except Urlerror, E:   if Hasattr (E, ' Code '):    print ' server does not respond properly to this request! '    print ' Error code: ', E.code   elif hasattr (E, ' reason '):    print ' cannot connect to the server '    print ' reason: ', e.reason< C8/>else:   print ' does not appear abnormal '

Catch the exception successfully!



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A simple walkthrough of using the Urllib2 module to write crawlers in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A simple walkthrough of using the Urllib2 module to write crawlers in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support