[Python] web crawler (2): uses urllib2 to capture webpage content through a specified URL

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Webpage capturing means to read the network resources specified in the URL from the network stream and save them to the local device. Version: Python2.7.5 and Python3 are greatly changed. For more information, see the tutorial.

Webpage capturing means to read the network resources specified in the URL from the network stream and save them to the local device.
Similar to simulating the functions of IE browser using a program, the URL is sent to the server as the content of the HTTP request, and then the server's response resources are read.

In Python, we use the urllib2 component to Capture webpages.
Urllib2 is a Python component used to obtain URLs (Uniform Resource Locators.

It provides a very simple interface in the form of urlopen functions.

The simplest urllib2 application code requires only four lines.

Create a new urllib2_test01.py file to check the role of urllib2:

import urllib2  response = urllib2.urlopen('http://www.baidu.com/')  html = response.read()  print html

Press F5 to view the running result:

Let's create a new urllib2_test02.py file to feel it:

import urllib2    req = urllib2.Request('http://www.baidu.com')    response = urllib2.urlopen(req)    the_page = response.read()    print the_page

The output content is the same as test01.

Urllib2 uses the same interface to process all URL headers. For example, you can create an ftp request as follows.

req = urllib2.Request('ftp://example.com/')

In HTTP requests, you are allowed to perform two additional tasks.

1. send data form data

This content is believed to be no stranger to Web terminals,

Sometimes you want to send some data to the URL (usually the URL is associated with the CGI [Universal Gateway Interface] script, or other WEB applications ).

In HTTP, this is often sent using well-known POST requests.

This is usually done by your browser when you submit an HTML form.

Not all POSTs are from forms. you can use POST to submit arbitrary data to your own program.

For general HTML forms, data must be encoded in the standard format. Then, it is uploaded as a data parameter to the Request object.

Encoding uses the urllib function instead of urllib2.

Let's create a new urllib2_test03.py file to feel it:

Import urllib import urllib2 url = 'http: // www.someserver.com/register.cgi' values = {'name': 'why', 'Location': 'sdu ', 'language ': 'python'} data = urllib. urlencode (values) # encoding work req = urllib2.Request (url, data) # send a request and pass the data form response = urllib2.urlopen (req) # accept the feedback the_page = response. read () # read the feedback

If the data parameter is not transmitted, urllib2 uses the GET request.

The difference between GET and POST requests is that POST requests usually have "Side Effects ",

They may change the system status in some way (for example, submitting a pile of garbage to your door ).

Data can also be encoded in the Get request URL.

import urllib2    import urllib    data = {}    data['name'] = 'WHY'    data['location'] = 'SDU'    data['language'] = 'Python'    url_values = urllib.urlencode(data)    print url_values    name=Somebody+Here&language=Python&location=Northampton    url = 'http://www.example.com/example.cgi'    full_url = url + '?' + url_values    data = urllib2.open(full_url)

In this way, Get transmission of Data is realized.

2. set Headers to http requests

Some websites do not like to be accessed by programs (not manually accessed), or send different versions of content to different browsers.

By default, urllib2 uses itself as "Python-urllib/x. y" (x and y are the main Python version and minor version, such as Python-urllib/2.7 ),

This identity may confuse the site or simply stop working.

The browser confirms that its identity is through the User-Agent header. when you create a request object, you can give it a dictionary containing the header data.

The following example sends the same content as above, but simulates itself as Internet Explorer.

(Thank you for reminding me that this Demo is no longer available, but the principle remains the same ).

import urllib    import urllib2      url = 'http://www.someserver.com/cgi-bin/register.cgi'    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'    values = {'name' : 'WHY',              'location' : 'SDU',              'language' : 'Python' }      headers = { 'User-Agent' : user_agent }    data = urllib.urlencode(values)    req = urllib2.Request(url, data, headers)    response = urllib2.urlopen(req)    the_page = response.read()

The above is the [Python] web crawler (2): uses urllib2 to capture the content of the webpage through the specified URL. For more information, see The PHP Chinese website (www.php1.cn )!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Python] web crawler (2): uses urllib2 to capture webpage content through a specified URL

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Python] web crawler (2): uses urllib2 to capture webpage content through a specified URL

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support