0 Basic Write Python crawler uses URLLIB2 components to crawl Web content

Source: Internet
Author: User
Tags urlencode
Version number: Python2.7.5,python3 changes are large, you find another tutorial.

The so-called Web crawl, is the URL address specified in the network resources from the network stream read out, save to Local.
Similar to the use of the program to simulate the function of IE browser, the URL as the content of the HTTP request to the server side, and then read the server-side response resources.

In Python, we use the URLLIB2 component to crawl Web pages.
URLLIB2 is a component of Python's acquisition of URLs (Uniform Resource Locators).

It provides a very simple interface in the form of a urlopen function.

The simplest URLLIB2 application code requires only four rows.

Let's create a new file urllib2_test01.py to feel the Urllib2 effect:

Import Urllib2
Response = Urllib2.urlopen (' http://www.baidu.com/')
html = Response.read ()
Print HTML


Press F5 to see the results of the run:


We can open the Baidu homepage, right click, choose to view the source code (Firefox or Google browser can), will find the exact same content.

That is, the above four lines of code to our visit to Baidu when the browser received the code are all printed out.

This is one of the simplest examples of urllib2.

In addition to "http:", URLs can also be replaced with "ftp:", "File:" And so on.

HTTP is based on the request and response mechanism:

The client presents a request and the server provides a response.

URLLIB2 uses a Request object to map the HTTP request you made.

In its simplest form of use, you will create a request object with the address you want,

By calling Urlopen and passing in the request object, a related request response object is returned.

This response object is like a file object, so you can call. Read () in response.

Let's create a new file urllib2_test02.py to feel:

Import Urllib2  
req = Urllib2. Request (' http://www.baidu.com ')
Response = Urllib2.urlopen (req)
The_page = Response.read ()
Print The_page

You can see that the output content is the same as the test01.

URLLIB2 uses the same interface to handle all URL headers. For example, you can create an FTP request as follows.

req = Urllib2. Request (' ftp://example.com/')

There are two additional things that you are allowed to do in an HTTP request.

1. Sending data forms

This content is believed to have done the web side is not unfamiliar,

Sometimes you want to send some data to the URL (usually URL with the cgi[Universal Gateway Interface] script, or another Web application to hook up).

In HTTP, this is often sent using a well-known post request.

This is usually done by your browser when you submit an HTML form.

Not all posts are sourced from the form, and you can use post to submit arbitrary data to your own program.

In general HTML forms, data needs to be encoded in standard form. The data parameter is then passed to the request object.

Coding works using Urllib functions rather than urllib2.

Let's create a new file urllib2_test03.py to feel:

Import Urllib  
Import Urllib2
url = ' http://www.someserver.com/register.cgi '
Values = {' name ': ' Why ',
' Location ': ' SDU ',
' Language ': ' Python '}
data = Urllib.urlencode (values) # coding work
req = Urllib2. Request (URL, data) # to send requests to transmit data form simultaneously
Response = Urllib2.urlopen (req) #接受反馈的信息
The_page = Response.read () #读取反馈的内容

If the data parameter is not transferred, URLLIB2 uses the Get method request.

The difference between get and post requests is that post requests usually have "side effects",

They can change the state of the system in some way (for example, by submitting piles of rubbish to your doorstep).

Data can also be transmitted by encoding the URL itself on the GET request.

Import Urllib2  
Import Urllib
data = {}
data[' name '] = ' why '
data[' location '] = ' SDU '
data[' language ' = ' Python '
Url_values = Urllib.urlencode (data)
Print Url_values
Name=somebody+here&language=python&location=northampton
url = ' http://www.example.com/example.cgi '
Full_url = URL + '? ' + url_values

This enables the get transfer of data.

2. Set headers to HTTP request

Some sites do not like to be accessed by programs (non-human access) or to send different versions of content to different browsers.

The default URLLIB2 takes itself as "python-urllib/x.y" (x and Y are Python major and minor version numbers, such as python-urllib/2.7),

This identity may confuse the site, or simply not work.

The browser confirms its identity through the user-agent header, and when you create a request object, you can give him a dictionary containing the header data.

The following example sends the same content as above, but simulates itself as an internet Explorer.

(Thank you for reminding me that this demo is not available now, but the principle is still the same).

Import Urllib  
Import Urllib2
url = ' http://www.someserver.com/cgi-bin/register.cgi '
User_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '
Values = {' name ': ' Why ',
' Location ': ' SDU ',
' Language ': ' Python '}
headers = {' User-agent ': user_agent}
data = Urllib.urlencode (values)
req = Urllib2. Request (URL, data, headers)
Response = Urllib2.urlopen (req)
The_page = Response.read ()

The above is Python use urllib2 through the specified URL to crawl the content of the Web page, it is very simple, I hope to be helpful to everyone

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.