A simple example of using the urllib2 module in Python to write Crawlers

Source: Internet
Author: User
Tags python web crawler
This article describes how to use the urllib2 module to compile a crawler in Python. It also introduces how to add relevant exception handling functions, if you need it, you can refer to the python web crawler and you have to talk about the powerful component urllib2. In python, The urllib2 component is used to capture webpages. Urllib2 is a Python component used to obtain URLs (Uniform Resource Locators. It provides a very simple interface in the form of urlopen functions. The following code briefly describes the functions of urllib2;

import urllib2 response = urllib2.urlopen('http://www.baidu.com/') html = response.read() print html

The running result is as follows;

Check http://www.baidu.com/source code to see how it works. In addition to http:, the URL can also be ftp: or file:
Urllib2 uses a Request object to map the proposed HTTP Request. You can create a Request object. By calling urlopen and passing in the Request object, a response object of the relevant Request will be returned. This Response object is like a file object, so you can call it in response. read (). Modify the Code as follows;

import urllib2 req = urllib2.Request('http://www.baidu.com') response = urllib2.urlopen(req) page = response.read() print page

The running result is the same as before modification. At the same time, you also need to do the following before sending an http request 1. Send form data. 2. Set the headers information.
1. Send form data. Generally, send data to the server during logon simulation. The post method is mainly used here. Generally, HTML forms and data must be encoded in the standard format. Then, it is uploaded as a data parameter to the Request object. Encoding uses the urllib function instead of urllib2. The test code is as follows:

Import urllib import urllib2 url = 'HTTP: // www.server.com/register.php' postData = {'useid': 'user', 'pwd': '***', 'language ': 'python'} data = urllib. urlencode (postData) # encoding work req = urllib2.Request (url, data) # send a request and send data response = urllib2.urlopen (req) # accept the feedback page = response. read () # read the feedback

At the same time, urllib2 can use the get method to transmit data. The Code is as follows;

import urllib2 import urllib  data = {}  data['useid'] = 'user' data['pwd'] = '***' data['language'] = 'Python'  values = urllib.urlencode(data) print values  name=Somebody+Here&language=Python&location=Northampton url = 'http://www.example.com/example.php' full_url = url + '?' + url_values  data = urllib2.open(full_url)

2. Set the headers information. Some sites impose restrictions on the access source, so the User-Agent header is simulated here. The Code is as follows;

import urllib import urllib2  url = 'http://www.server.com/register.php'  user_agent = 'Mozilla/5.0 (Windows NT 6.1; rv:33.0) Gecko/20100101 Firefox/33.0' values = {'useid' : 'user',    'pwd' : '***',    'language' : 'Python' }  headers = { 'User-Agent' : user_agent } data = urllib.urlencode(values) req = urllib2.Request(url, data, headers) response = urllib2.urlopen(req) page = response.read()

Here we will introduce urllib2!

Exception Handling
Generally, URLError is generated when there is no network connection or the server address is unavailable. In this case, an exception will contain the resaon attribute containing the error code and error message. The following code is used to test the effect;

import urllib import urllib2  url = 'http://www.server.com/register.php'  user_agent = 'Mozilla/5.0 (Windows NT 6.1; rv:33.0) Gecko/20100101 Firefox/33.0' values = {'useid' : 'user',    'pwd' : '***',    'language' : 'Python' }  headers = { 'User-Agent' : user_agent } data = urllib.urlencode(values) req = urllib2.Request(url, data, headers) response = urllib2.urlopen(req) page = response.read()

After checking the relevant information, the error message "Errno 10061" is displayed, indicating that the server rejects the request.
In addition, there is HTTPError. When a normal connection is established between the client and the server, urllib2 starts to process related data. If an error cannot be processed, the corresponding HTTPError will be generated. For example, the common error codes for Website access are "404" (the page cannot be found) and "403" (The request is forbidden ), and "401" (with verification request ...... The HTTP status code indicates the response of the HTTP protocol. For details about common status codes, see HTTP status codes.
HTTPError has a 'code' attribute, which is the error code sent by the server. When an HTTPError is generated, the server returns a related error code and error page. The following code is used for verification;

import urllib2  req = urllib2.Request('http://www.python.org/callmewhy')  try:  urllib2.urlopen(req)  except urllib2.URLError, e:   print e.code 

Code 404 is output, indicating that the page cannot be found.
Capture and handle exceptions ...... The implementation code is as follows;


#-*-Coding: UTF-8-*-from urllib2 import Request, urlopen, URLError, HTTPError req = Request ('HTTP: // www.python.org/callmewhy') try: response = urlopen (req) failed t URLError, e: if hasattr (e, 'code'): print 'the server cannot respond to this request normally! 'Print 'error code: ', e. code elif hasattr (e, 'reason '): print' cannot establish a connection with the server 'print 'reason: ', e. reason else: print 'no exception occurred'

An exception is caught!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.