0 Basic Write Python crawler uses URLLIB2 components to crawl Web content

Last Update:2016-06-06 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Version number: Python2.7.5,python3 changes are large, you find another tutorial.

The so-called Web crawl, is the URL address specified in the network resources from the network stream read out, save to Local.
Similar to the use of the program to simulate the function of IE browser, the URL as the content of the HTTP request to the server side, and then read the server-side response resources.

In Python, we use the URLLIB2 component to crawl Web pages.
URLLIB2 is a component of Python's acquisition of URLs (Uniform Resource Locators).

It provides a very simple interface in the form of a urlopen function.

The simplest URLLIB2 application code requires only four rows.

Let's create a new file urllib2_test01.py to feel the Urllib2 effect:

Import Urllib2
Response = Urllib2.urlopen (' http://www.baidu.com/')
html = Response.read ()
Print HTML

Press F5 to see the results of the run:

We can open the Baidu homepage, right click, choose to view the source code (Firefox or Google browser can), will find the exact same content.

That is, the above four lines of code to our visit to Baidu when the browser received the code are all printed out.

This is one of the simplest examples of urllib2.

In addition to "http:", URLs can also be replaced with "ftp:", "File:" And so on.

HTTP is based on the request and response mechanism:

The client presents a request and the server provides a response.

URLLIB2 uses a Request object to map the HTTP request you made.

In its simplest form of use, you will create a request object with the address you want,

By calling Urlopen and passing in the request object, a related request response object is returned.

This response object is like a file object, so you can call. Read () in response.

Let's create a new file urllib2_test02.py to feel:

Import Urllib2  
req = Urllib2. Request (' http://www.baidu.com ')  
Response = Urllib2.urlopen (req)  
The_page = Response.read ()  
Print The_page

You can see that the output content is the same as the test01.

URLLIB2 uses the same interface to handle all URL headers. For example, you can create an FTP request as follows.

req = Urllib2. Request (' ftp://example.com/')

There are two additional things that you are allowed to do in an HTTP request.

1. Sending data forms

This content is believed to have done the web side is not unfamiliar,

Sometimes you want to send some data to the URL (usually URL with the cgi[Universal Gateway Interface] script, or another Web application to hook up).

In HTTP, this is often sent using a well-known post request.

This is usually done by your browser when you submit an HTML form.

Not all posts are sourced from the form, and you can use post to submit arbitrary data to your own program.

In general HTML forms, data needs to be encoded in standard form. The data parameter is then passed to the request object.

Coding works using Urllib functions rather than urllib2.

Let's create a new file urllib2_test03.py to feel:

Import Urllib  
Import Urllib2  
url = ' http://www.someserver.com/register.cgi '  
Values = {' name ': ' Why ',  
          ' Location ': ' SDU ',  
          ' Language ': ' Python '}  
data = Urllib.urlencode (values) # coding work
req = Urllib2. Request (URL, data)  # to send requests to transmit data form simultaneously
Response = Urllib2.urlopen (req)  #接受反馈的信息
The_page = Response.read ()  #读取反馈的内容

If the data parameter is not transferred, URLLIB2 uses the Get method request.

The difference between get and post requests is that post requests usually have "side effects",

They can change the state of the system in some way (for example, by submitting piles of rubbish to your doorstep).

Data can also be transmitted by encoding the URL itself on the GET request.

Import Urllib2  
Import Urllib
data = {}
data[' name '] = ' why '  
data[' location '] = ' SDU '  
data[' language ' = ' Python '
Url_values = Urllib.urlencode (data)  
Print Url_values
Name=somebody+here&language=python&location=northampton  
url = ' http://www.example.com/example.cgi '  
Full_url = URL + '? ' + url_values

This enables the get transfer of data.

2. Set headers to HTTP request

Some sites do not like to be accessed by programs (non-human access) or to send different versions of content to different browsers.

The default URLLIB2 takes itself as "python-urllib/x.y" (x and Y are Python major and minor version numbers, such as python-urllib/2.7),

This identity may confuse the site, or simply not work.

The browser confirms its identity through the user-agent header, and when you create a request object, you can give him a dictionary containing the header data.

The following example sends the same content as above, but simulates itself as an internet Explorer.

(Thank you for reminding me that this demo is not available now, but the principle is still the same).

Import Urllib  
Import Urllib2  
url = ' http://www.someserver.com/cgi-bin/register.cgi '
User_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '  
Values = {' name ': ' Why ',  
          ' Location ': ' SDU ',  
          ' Language ': ' Python '}  
headers = {' User-agent ': user_agent}  
data = Urllib.urlencode (values)  
req = Urllib2. Request (URL, data, headers)  
Response = Urllib2.urlopen (req)  
The_page = Response.read ()

The above is Python use urllib2 through the specified URL to crawl the content of the Web page, it is very simple, I hope to be helpful to everyone



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More