http://blog.csdn.net/pleasecallmewhy/article/details/8923067
Version number: Python2.7.5,python3 changes larger, you find another tutorial.
The so-called web crawl, is the URL address specified in the network resources from the network stream to read out, save to the local.
Similar to using the program to simulate the function of IE browser, the URL is sent as the content of the HTTP request to the server side, and then read the server-side response resources.
In Python, we use the URLLIB2 component to crawl a Web page.
URLLIB2 is a Python component that gets the URLs (uniform Resource Locators).
It provides a very simple interface in the form of a urlopen function.
The simplest URLLIB2 application code requires only four lines.
We create a new file urllib2_test01.py to feel the role of URLLIB2:
[python] view plain copy import urllib2 response = Urllib2.urlopen (' http://www.baidu.com/') HTML = RESPONSE.R EAD () Print HTML
Press F5 to see the results of the run:
We can open Baidu homepage, right click, choose to view the source code (Firefox or Google browser can), will find that the same content.
In other words, the above four lines of code will we visit Baidu when the browser received the code are all printed out.
This is one of the simplest examples of urllib2.
In addition to "http:", URLs can also be replaced with "ftp:", "File:" And so on.
HTTP is based on the request and answer mechanism:
The client requests, and the server provides the answer.
URLLIB2 uses a Request object to map the HTTP request you have made.
In its simplest form of use, you will create a request object with the address you are requesting.
By calling Urlopen and passing in the request object, a related request response object is returned.
This answer object is like a file object, so you can call it in response. Read ().
We create a new file urllib2_test02.py to feel:
[python] view plain copy import urllib2 req = Urllib2. Request (' http://www.baidu.com ') response = Urllib2.urlopen (req) the_page = Response.read () print The_page
You can see that the output is the same as the test01.
URLLIB2 uses the same interface to handle all the URL headers. For example, you can create an FTP request as follows.
[python] view plain copy req = urllib2. Request (' ftp://example.com/') allows you to do two extra things when HTTP requests are made.
1. Send data Form
This content is believed to have done the web side is not unfamiliar,
Sometimes you want to send some data to a URL (usually a URL with a cgi[generic Gateway Interface] script, or another Web application hook).
In HTTP, this is often sent using a familiar post request.
This is usually done by your browser when you submit an HTML form.
Not all posts are derived from forms, and you can use post to submit arbitrary data to your own program.
For general HTML forms, data needs to be encoded into standard form. The data parameter is then uploaded to the request object.
The encoding work uses urllib functions rather than urllib2.
We create a new file urllib2_test03.py to feel:
[python] view plain copy import urllib import urllib2 url = ' http://www.someserver.com/register.cgi ' Values = {' name ': ' WHY ', ' Location ': ' SDU ', ' language ': ' Python '} data = ur Llib.urlencode (values) # coding Work req = Urllib2. Request (URL, data) # Send requests to pass data form response = Urllib2.urlopen (req) #接受反馈的信息 the_page = Response.read () #读取反馈的内容
If the data parameter is not transferred, the URLLIB2 uses the Get method request.
The difference between a get and post request is that the POST request usually has a "side effect",
They will change the state of the system in some way (for example, by submitting piles of rubbish to your door).
Data can also be transmitted by encoding the URL of the GET request itself.
[python] view plain copy import urllib2 import urllib data = {} data[' name '] = ' WHY ' data[' tion '] = ' SDU ' data[' language '] = ' Python ' url_values = urllib.urlencode (data) print Url_values Name=som Ebody+here&language=python&location=northampton url = ' http://www.example.com/example.cgi ' full_url = URL + '? ' + url_values data = Urllib2.open (Full_url)
This enables the get transfer of data.
2. Set headers to HTTP request
Some sites do not like to be accessed by programs (not for human access), or to send different versions of content to different browsers.
The default URLLIB2 takes itself as "python-urllib/x.y" (x and Y are Python major and minor versions, such as python-urllib/2.7),
This identity may confuse the site or simply don't work.
The browser confirms that its identity is through the user-agent header, and when you create a request object, you can give him a dictionary containing the header data.
The following example sends the same content as above, but simulates itself as an internet Explorer.
(Thanks to everyone for reminding, now this demo is not available, but the principle is still the same).
[Python] View plain copy import urllib import urllib2 url = ' http://www.someserver.com/cgi-bin/register.cgi ' user_agent = ' mozilla/4.0 (compatible; msie 5.5; windows nt) ' values = {' name ' : ' WHY ', ' location ' : ' SDU ', ' language ' : ' Python ' } headers = { ' user-agent ' : user_agent } data = Urllib.urlencode (values) req = urllib2. Request (url, data, headers) Response = urllib2.urlopen (req) The_page = response.read ()