A simple example of using the urllib2 module in Python to write Crawlers

Last Update:2018-07-18 Source: Internet

Author: User

Tags python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article describes how to use the urllib2 module to compile a crawler in Python. It also introduces how to add relevant exception handling functions, if you need it, you can refer to the python web crawler and you have to talk about the powerful component urllib2. In python, The urllib2 component is used to capture webpages. Urllib2 is a Python component used to obtain URLs (Uniform Resource Locators. It provides a very simple interface in the form of urlopen functions. The following code briefly describes the functions of urllib2;

import urllib2 response = urllib2.urlopen('http://www.baidu.com/') html = response.read() print html

The running result is as follows;

Check http://www.baidu.com/source code to see how it works. In addition to http:, the URL can also be ftp: or file:
Urllib2 uses a Request object to map the proposed HTTP Request. You can create a Request object. By calling urlopen and passing in the Request object, a response object of the relevant Request will be returned. This Response object is like a file object, so you can call it in response. read (). Modify the Code as follows;

import urllib2 req = urllib2.Request('http://www.baidu.com') response = urllib2.urlopen(req) page = response.read() print page

The running result is the same as before modification. At the same time, you also need to do the following before sending an http request 1. Send form data. 2. Set the headers information.
1. Send form data. Generally, send data to the server during logon simulation. The post method is mainly used here. Generally, HTML forms and data must be encoded in the standard format. Then, it is uploaded as a data parameter to the Request object. Encoding uses the urllib function instead of urllib2. The test code is as follows:

Import urllib import urllib2 url = 'HTTP: // www.server.com/register.php' postData = {'useid': 'user', 'pwd': '***', 'language ': 'python'} data = urllib. urlencode (postData) # encoding work req = urllib2.Request (url, data) # send a request and send data response = urllib2.urlopen (req) # accept the feedback page = response. read () # read the feedback

At the same time, urllib2 can use the get method to transmit data. The Code is as follows;

import urllib2 import urllib  data = {}  data['useid'] = 'user' data['pwd'] = '***' data['language'] = 'Python'  values = urllib.urlencode(data) print values  name=Somebody+Here&language=Python&location=Northampton url = 'http://www.example.com/example.php' full_url = url + '?' + url_values  data = urllib2.open(full_url)

2. Set the headers information. Some sites impose restrictions on the access source, so the User-Agent header is simulated here. The Code is as follows;

import urllib import urllib2  url = 'http://www.server.com/register.php'  user_agent = 'Mozilla/5.0 (Windows NT 6.1; rv:33.0) Gecko/20100101 Firefox/33.0' values = {'useid' : 'user',    'pwd' : '***',    'language' : 'Python' }  headers = { 'User-Agent' : user_agent } data = urllib.urlencode(values) req = urllib2.Request(url, data, headers) response = urllib2.urlopen(req) page = response.read()

Here we will introduce urllib2!

Exception Handling
Generally, URLError is generated when there is no network connection or the server address is unavailable. In this case, an exception will contain the resaon attribute containing the error code and error message. The following code is used to test the effect;

import urllib import urllib2  url = 'http://www.server.com/register.php'  user_agent = 'Mozilla/5.0 (Windows NT 6.1; rv:33.0) Gecko/20100101 Firefox/33.0' values = {'useid' : 'user',    'pwd' : '***',    'language' : 'Python' }  headers = { 'User-Agent' : user_agent } data = urllib.urlencode(values) req = urllib2.Request(url, data, headers) response = urllib2.urlopen(req) page = response.read()

After checking the relevant information, the error message "Errno 10061" is displayed, indicating that the server rejects the request.
In addition, there is HTTPError. When a normal connection is established between the client and the server, urllib2 starts to process related data. If an error cannot be processed, the corresponding HTTPError will be generated. For example, the common error codes for Website access are "404" (the page cannot be found) and "403" (The request is forbidden ), and "401" (with verification request ...... The HTTP status code indicates the response of the HTTP protocol. For details about common status codes, see HTTP status codes.
HTTPError has a 'code' attribute, which is the error code sent by the server. When an HTTPError is generated, the server returns a related error code and error page. The following code is used for verification;

import urllib2  req = urllib2.Request('http://www.python.org/callmewhy')  try:  urllib2.urlopen(req)  except urllib2.URLError, e:   print e.code

Code 404 is output, indicating that the page cannot be found.
Capture and handle exceptions ...... The implementation code is as follows;

#-*-Coding: UTF-8-*-from urllib2 import Request, urlopen, URLError, HTTPError req = Request ('HTTP: // www.python.org/callmewhy') try: response = urlopen (req) failed t URLError, e: if hasattr (e, 'code'): print 'the server cannot respond to this request normally! 'Print 'error code: ', e. code elif hasattr (e, 'reason '): print' cannot establish a connection with the server 'print 'reason: ', e. reason else: print 'no exception occurred'

An exception is caught!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More