[Python] web crawler (2): Uses urllib2 to capture webpage content through a specified URL

Source: Internet
Author: User

Webpage capturing means to read the network resources specified in the URL from the network stream and save them to the local device.
Similar to simulating the functions of IE browser using a program, the URL is sent to the server as the content of the HTTP request, and then the server's response resources are read.


In python, we use the urllib2 component to capture webpages.
Urllib2 is a python component used to obtain URLs (Uniform Resource Locators.

It provides a very simple interface in the form of urlopen functions.

The simplest urllib2 application code requires only four lines.

Create a new urllib2_test01.py file to check the role of urllib2:

import urllib2response = urllib2.urlopen('http://www.baidu.com/')html = response.read()print html

Press F5 to view the running result:


We can open the Baidu homepage, right-click it, and choose to view the source code (either Firefox or Google Chrome), and we will find the same content.

That is to say, the above four lines of code print out all the code we received when accessing Baidu.

This is the simplest urllib2 example.


In addition to "http:", URLs can also be replaced by "ftp:", "file:", and so on.

HTTP is based on the request and response mechanisms:

The client initiates a request and the server provides a response.


Urllib2 uses a request object to map your HTTP request.

In its simplest form of use, you will use the address you want to request to create a request object,

By calling urlopen and passing in the request object, a response object of the relevant request is returned,

This response object is like a file object, so you can call. Read () in response ().

Let's create a new urllib2_test02.py file to feel it:

import urllib2  req = urllib2.Request('http://www.baidu.com')  response = urllib2.urlopen(req)  the_page = response.read()  print the_page


The output content is the same as test01.

Urllib2 uses the same interface to process all URL headers. For example, you can create an FTP request as follows.

req = urllib2.Request('ftp://example.com/')

In HTTP requests, you are allowed to perform two additional tasks.

1. Send data form data

This content is believed to be no stranger to Web terminals,

Sometimes you want to send some data to the URL (usually the URL is associated with the CGI [Universal gateway interface] script, or other Web applications ).

In HTTP, this is often sent using well-known POST requests.

This is usually done by your browser when you submit an HTML form.

Not all posts are from forms. You can use post to submit arbitrary data to your own program.

For general HTML forms, data must be encoded in the standard format. Then, it is uploaded as a data parameter to the request object.

Encoding uses the urllib function instead of urllib2.

Let's create a new urllib2_test03.py file to feel it:

Import urllib import urllib2 url = 'HTTP: // www.someserver.com/register.cgi' values = {'name': 'why', 'location': 'sdu ', 'language ': 'python'} DATA = urllib. urlencode (values) # encoding work Req = urllib2.request (URL, data) # send a request and pass the data form response = urllib2.urlopen (req) # accept the feedback the_page = response. read () # Read the feedback

If the data parameter is not transmitted, urllib2 uses the GET request.

The difference between get and post requests is that post requests usually have "Side effects ",

They may change the system status in some way (for example, submitting a pile of garbage to your door ).

Data can also be encoded in the GET request URL.

import urllib2  import urllibdata = {}data['name'] = 'WHY'  data['location'] = 'SDU'  data['language'] = 'Python'url_values = urllib.urlencode(data)  print url_valuesname=Somebody+Here&language=Python&location=Northampton  url = 'http://www.example.com/example.cgi'  full_url = url + '?' + url_valuesdata = urllib2.open(full_url)  

In this way, get transmission of data is realized.


2. Set headers to HTTP requests

Some websites do not like to be accessed by programs (not manually accessed), or send different versions of content to different browsers.

By default, urllib2 uses itself as "Python-urllib/x. y" (X and Y are the main Python version and minor version, such as Python-urllib/2.7 ),

This identity may confuse the site or simply stop working.

The browser confirms that its identity is through the User-Agent header. When you create a request object, you can give it a dictionary containing the header data.

The following example sends the same content as above, but simulates itself as Internet Explorer.

import urllib  import urllib2  url = 'http://www.someserver.com/cgi-bin/register.cgi'user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  values = {'name' : 'WHY',            'location' : 'SDU',            'language' : 'Python' }  headers = { 'User-Agent' : user_agent }  data = urllib.urlencode(values)  req = urllib2.Request(url, data, headers)  response = urllib2.urlopen(req)  the_page = response.read() 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.