[Python] web crawler (2): uses urllib2 to capture webpage content through a specified URL

Source: Internet
Author: User
Webpage capturing means to read the network resources specified in the URL from the network stream and save them to the local device. Version: Python2.7.5 and Python3 are greatly changed. For more information, see the tutorial.

Webpage capturing means to read the network resources specified in the URL from the network stream and save them to the local device.
Similar to simulating the functions of IE browser using a program, the URL is sent to the server as the content of the HTTP request, and then the server's response resources are read.


In Python, we use the urllib2 component to Capture webpages.
Urllib2 is a Python component used to obtain URLs (Uniform Resource Locators.

It provides a very simple interface in the form of urlopen functions.

The simplest urllib2 application code requires only four lines.

Create a new urllib2_test01.py file to check the role of urllib2:

import urllib2  response = urllib2.urlopen('http://www.baidu.com/')  html = response.read()  print html

Press F5 to view the running result:

Let's create a new urllib2_test02.py file to feel it:

import urllib2    req = urllib2.Request('http://www.baidu.com')    response = urllib2.urlopen(req)    the_page = response.read()    print the_page

The output content is the same as test01.

Urllib2 uses the same interface to process all URL headers. For example, you can create an ftp request as follows.

req = urllib2.Request('ftp://example.com/')

In HTTP requests, you are allowed to perform two additional tasks.


1. send data form data

This content is believed to be no stranger to Web terminals,

Sometimes you want to send some data to the URL (usually the URL is associated with the CGI [Universal Gateway Interface] script, or other WEB applications ).

In HTTP, this is often sent using well-known POST requests.

This is usually done by your browser when you submit an HTML form.

Not all POSTs are from forms. you can use POST to submit arbitrary data to your own program.

For general HTML forms, data must be encoded in the standard format. Then, it is uploaded as a data parameter to the Request object.

Encoding uses the urllib function instead of urllib2.

Let's create a new urllib2_test03.py file to feel it:

Import urllib import urllib2 url = 'http: // www.someserver.com/register.cgi' values = {'name': 'why', 'Location': 'sdu ', 'language ': 'python'} data = urllib. urlencode (values) # encoding work req = urllib2.Request (url, data) # send a request and pass the data form response = urllib2.urlopen (req) # accept the feedback the_page = response. read () # read the feedback


If the data parameter is not transmitted, urllib2 uses the GET request.

The difference between GET and POST requests is that POST requests usually have "Side Effects ",

They may change the system status in some way (for example, submitting a pile of garbage to your door ).

Data can also be encoded in the Get request URL.

import urllib2    import urllib    data = {}    data['name'] = 'WHY'    data['location'] = 'SDU'    data['language'] = 'Python'    url_values = urllib.urlencode(data)    print url_values    name=Somebody+Here&language=Python&location=Northampton    url = 'http://www.example.com/example.cgi'    full_url = url + '?' + url_values    data = urllib2.open(full_url)

In this way, Get transmission of Data is realized.



2. set Headers to http requests

Some websites do not like to be accessed by programs (not manually accessed), or send different versions of content to different browsers.

By default, urllib2 uses itself as "Python-urllib/x. y" (x and y are the main Python version and minor version, such as Python-urllib/2.7 ),

This identity may confuse the site or simply stop working.

The browser confirms that its identity is through the User-Agent header. when you create a request object, you can give it a dictionary containing the header data.

The following example sends the same content as above, but simulates itself as Internet Explorer.

(Thank you for reminding me that this Demo is no longer available, but the principle remains the same ).

import urllib    import urllib2      url = 'http://www.someserver.com/cgi-bin/register.cgi'    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'    values = {'name' : 'WHY',              'location' : 'SDU',              'language' : 'Python' }      headers = { 'User-Agent' : user_agent }    data = urllib.urlencode(values)    req = urllib2.Request(url, data, headers)    response = urllib2.urlopen(req)    the_page = response.read()

The above is the [Python] web crawler (2): uses urllib2 to capture the content of the webpage through the specified URL. For more information, see The PHP Chinese website (www.php1.cn )!


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.