python--in-depth understanding of urllib, URLLIB2 and requests (requests not recommended?) )

Source: Internet
Author: User
Tags http cookie http redirect urlencode

Deep understanding of Urllib, URLLIB2 and requests


650) this.width=650; "class=" Project-logo "src=" http://codefrom.oss-cn-hangzhou.aliyuncs.com/www/2015/06-03/ 00380d0fbed52c2b5d697152ed3922d6 "/> python

Python is an object-oriented, interpreted computer programming language, invented by Guido van Rossum at the end of 1989, the first public release was released in 1991, and the Python source code also follows the GPL (GNU general public License) Agreement [1]. The Python syntax is concise and clear, with a rich and powerful class library.

Urllib and Urllib2 differences

Both the Urllib and URLLIB2 modules do what they do with request URLs, but they provide different functionality.
Urllib2.urlopen accepts an instance of the Request class or a URL, (whereas Urllib.urlopen only accepts a URL Chinese means: Urlli B2.urlopen can accept a request object or URL, (when accepting the request object, and can set a URL for headers), Urllib.urlopen only receive a URL
Urllib There is urlencode,urllib2 no, this is why always urllib,urllib2 often use together reason

R = Request (url= ' http://www.mysite.com ') r.add_header (' user-agent ', ' awesome Fetcher ') R.add_data (urllib.urlencode ({' foo ': ' Bar '}) Response = Urllib2.urlopen (r) #post method
urllib Module

I. UrlEncode cannot directly process Unicode objects, so if it is Unicode, it needs to be encoded first, and Unicode goes to UTF8, for example:

Urllib.urlencode (U ' bl '. Encode (' Utf-8 '))

II. Example

Import urllib #sohu mobile Home url = ' http://m.sohu.com/?v=3&_once_=000025_v2tov3&_smuid= icvxxapq5eftpqtvq6 Tpz ' resp = urllib.urlopen (URL) page = Resp.read () F = open ('./urllib_index.html ', ' W ') f.write (page) print dir (resp)

Results:

[' Doc ', ' init ', ' iter ', ' module ', ' repr ', ' close ', ' Code ', ' Fileno ', ' FP ', ' GetCode ', ' geturl ', ' headers ', ' info ', ' Next ', ' Read ', ' ReadLine ', ' readlines ', ' url '] Print resp.getcode (), Resp.geturl (), Resp.info (), Resp.headers, Resp.url #resp. u RL and Resp.geturl () as a result

Iii. codec examples Urllib.quote and urllib.urlencode are coded, but not the same

 48     s = urllib.quote (' This is python ')    #编码  49     print  ' quote:\t ' +s    # spaces with%20 instead of  50      s_un = urllib.unquote (s)     # decode  51      print  ' unquote:\t ' +s_un 52     s_plus =  Urllib.quote_plus (' This is python ')   # code  53     print  ' Quote_plus:\t ' +s_plus            # space with + replacement  54      s_unplus = urllib.unquote_plus (S_plus)         # decoding  55     print  ' s_unplus:\t ' +s_unplus 56      s_dict = {' name ':  ' DKF ',  ' pass ':  ' 1234 '} 57      s_encode = urlLib.urlencode (s_dict)     # encoded dictionary converted to URL parameter   58     print   ' s_encode:\t ' +s_encode

Results:

Quote:this%20is%20python unquote:this is Python quote_plus:this+is+python s_unplus:this is Python S_encode:nam e=dkf&pass=1234

Iv. Urlretrieve () Urlretrieve most applicable only download-only features or display the download progress, etc.

url = ' http://m.sohu.com/?v=3&_once_=000025_v2tov3&_ smuid=icvxxapq5eftpqtvq6tpz ' Urlli B.urlretrieve (URL, './retrieve_index.html ') #直接把url链接网页内容下载到retrieve_index. html, for simple download functions. #urllib. Urlretrieve (URL, Local_name, method)
Urllib2

The functions and classes defined by the URLLIB2 module are used to obtain the URL (primarily HTTP), and he provides some complex interfaces for processing: Basic authentication, redirection, cookies, etc. Ii. common methods and Classes II.1 Urllib2.urlopen (url[, data][, timeout]) #传url时候, use Urllib urlopen in ii.1.1 It opens URL URLs, The URL parameter can be a string URL or a request object. Optional parameter timeout, which is a blocking operation in seconds, such as attempting to connect (if not specified, the SET global default timeout value will be used). In fact, this applies only to Http,https and FTP connections.

url = ' Http://m.sohu.com/?v=3&_once_=000025_v2tov3&_smuid=ICvXXapq5EfTpQTVq6Tpz ' resp = urllib2.urlop En (URL) page = Resp.read ()

The ii.1.2 Urlopen method can also explicitly indicate the URL you want to get by establishing a request object. Call the Urlopen function to return a response object to the requested URL. This response is similar to a file object, so you can manipulate the response object with the. Read () function.

url = ' Http://m.sohu.com/?v=3&_once_=000025_v2tov3&_smuid =icvxxapq5eftpqtvq6tpz ' req = urllib2. Request (URL) resp = urllib2.urlopen (req) page = Resp.read ()

II.2 class Urllib2. Request (url[, data][, headers][, originreqhost][, unverifiable])

The ii.2.1 request class is an abstraction for URL requests. The 5 parameters are described as follows: ii.2.1.1 url--is a string that contains a valid URL. ii.2.1.2 data--is a string that specifies that additional data is sent to the server, if no data needs to be sent can be "None". HTTP requests that currently use data are unique. When the request contains the data parameter, the HTTP request is post, not get. The data should be cached in a standard application/x-www-form-urlencoded format. The Urllib.urlencode () function returns a string of this format using a map or a 2-tuple. It is popular to say that if you want to send data to a URL (usually this data represents some CGI scripts or other Web applications). For example, when the form is filled in online, the browser will post the contents of the form, which needs to be encoded in a standard format (encode), and then passed as a data parameter to the Request object. The encoding is done in the Urlib module, not in the URLIB2. Here's an example:

Import urllibimport urllib2url = ' http://www.someserver.com/cgi-bin/register.cgi ' values = {' name ': ' Michael Foord ', ' Location ': ' Northampton ', ' language ': ' Python '}data = Urllib.urlencode (values) req = Urllib2. Request (URL, data) #send Postresponse = urllib2.urlopen (req) page = Response.read ()

ii.2.1.3 headers--is a dictionary type, the header dictionary can be passed in as a parameter directly to the request, or you can add each key and value as a parameter by calling the Add_header () method. The user-agent header, which identifies the browser, is often used for spoof and spoofing, because some HTTP services allow only certain requests to come from common browsers rather than scripts, or to return different versions for different browsers. For example, Mozilla Firefox browser is recognized as "mozilla/5.0 (X11; U Linux i686) gecko/20071127 firefox/2.0.0.11 ". By default, URLIB2 identifies itself as python-urllib/x.y (where XY is the major or minor version number of the Python release, as in Python 2.6, the Default user-agent string for URLLIB2 is "python-urllib/ 2.6. The following example differs from the above by adding a headers to the request and emulating the IE browser submission request.

Import urllibimport urllib2url = ' http://www.someserver.com/cgi-bin/register.cgi ' user_agent = ' mozilla/4.0 ( Compatible MSIE 5.5;  Windows NT) ' values = {' name ': ' Michael foord ', ' Location ': ' Northampton ', ' language ': ' Python '}headers = {' user-agent ': user_agent}data = Urllib.urlencode (values) req = Urllib2. Request (URL, data, headers) response = Urllib2.urlopen (req) the_page = Response.read ()

The standard headers composition is (Content-length, Content-type and Host), only the request object is called Urlopen () (which is also the case in the example above) or Openerdirector.open () When added. Two cases are as follows: Construct the request object with the headers parameter, as the previous example has initialized the header when generating the Request object, and the following example is the request object called Add_header (Key, Val) Method Append Header (the method of the request object is described below):

Import urllib2req = Urllib2. Request (' http://www.example.com/') req.add_header (' Referer ', ' http://www.python.org/') #http是无状态的协议, The last request from the client is not related to the next client-to-server request, and most omit this step R = Urllib2.urlopen (req)

Openerdirector automatically adds a user-agent header for each request, So the second method is as follows (Urllib2.buildopener will return a Openerdirector object, about the Urllib2.buildopener Class):

Import Urllib2opener = Urllib2.build_opener () opener.addheaders = [(' User-agent ', ' mozilla/5.0 ')]opener.open ('/HTTP/ www.example.com/')

II.3 Urllib2.installopener (opener) and Urllib2.buildopener ([handler, ...])
The two methods of Installopener and Buildopener are usually used together, and sometimes buildopener used alone to get the Openerdirector object.
Installopener instantiation will get the Openerdirector object used to give the global variable opener. If you want to use this opener to invoke Urlopen, then you have to instantiate openerdirector, so you can simply call Openerdirector.open () instead of Urlopen ().
The Build_opener instantiation also gets the Openerdirector object, where the parameter handlers can be instantiated by Basehandler or his subclasses. Subclasses can be instantiated by the following: Proxyhandler (if detection proxy settings are used) scanning agent will be used, it is important this, Unknownhandler, HttpHandler, Httpdefaulterrorhandler, Httpredirecthandler, Ftphandler, Filehandler, Httperrorprocessor.

Import urllib2req = Urllib2. Request (' http://www.python.org/') Opener=urllib2.build_opener () Urllib2.install_opener (opener) F = opener.open (req)

Use Urllib2.install_opener () to set URLLIB2 global opener as above. This can be handy for later use, but not finer grained control, like using two different Proxy settings in a program. It is good practice not to use Install_opener to change the global settings, but simply call opener's Open method instead of the global Urlopen method.

Speaking of the operation between opener and handler, it sounds a little dizzy. The idea is clear. When you get a URL, you can use a opener (a urllib2. The Openerdirector instance object, which can be generated by Build_opener instantiation. Normally the program always uses the default opener through Urlopen (that is, when you use the Urlopen method, the default opener object is implicitly used), but you can also create a custom openers (by manipulating Opener instances created by the handlers). All the hard work and trouble are given to these handlers. Each handler knows how to open a URL with a specific protocol (HTTP,FTP, etc.), or how to handle an HTTP redirect that occurs with an open URL, or an HTTP cookie contained in it. When creating openers, if you want to install a special Han Dlers to implement a fetch URL (such as getting a opener that handles a cookie, or a opener that does not handle redirection), first instantiate a Openerdirector object and then call it multiple times. Add _handler (some_handler_instance) to create a opene R. Alternatively, you can use Build_opener, which is a handy function to create a opener object that has only one function call. Build_opener adds many handlers by default, which provides a quick way to add more things and invalidate the default handler.

Install_opener can also be used to create a opener object as described above, but this object is the (global) default opener. This means that calling Urlopen will use the opener you just created. This means that the above code can be equivalent to the following paragraph. This code eventually uses the default opener. In general we use Build_opener to generate custom opener, and it is not necessary to call Install_opener unless it is for convenience.

Import urllib2req = Urllib2. Request (' http://www.python.org/') Opener=urllib2.build_opener () # Create opener Object Urllib2.install_opener (opener) # define Global The default Openerf = Urllib2.urlopen (req) #urlopen使用默认opener, but the Install_opener #已经把opener设为全局默认了, here is the use of the above established opener

Iii. exception Handling http://www.jb51.net/article/63711.htm When we call Urllib2.urlopen, it's not always going so well, like when the browser opens the URL and sometimes it gets an error, so we have to deal with the exception. When it comes to exceptions, let's start with a few common methods of returning response objects:
Geturl ()-Returns the retrieved URL resource, which is the true URL returned, usually used to authenticate the redirect
info ()-Returns the original information of the page just like a Field object, such as headers, which is mimetools. The message instance is in the format (you can refer to the HTTP headers description).
GetCode ()-Returns the HTTP status code of the response and runs the following code to get code=200 when a response cannot be processed, Urlopen throws a urlerror (for Python APIs, built-in exceptions such as, ValueError, TypeError and so on will also be thrown. )

    • Httperror is a subclass of the urlerror that the HTTP URL is thrown in under special circumstances. Here is a detailed talk about Urlerror and Httperror. Urlerror--handlers When a problem occurs (usually because there is no network connection that is not routed to the specified server, or the specified server does not exist)

    • Httperror--httperror is a subclass of Urlerror. Each response from the server HTTP contains "status code". Sometimes status code cannot handle this request. The default handler handles the responses of these exceptions. For example, Urllib2 discovers that the URL of response is not the same as the URL you requested, and is automatically processed when the redirect occurs. For a request that cannot be processed, Urlopen throws---httperror exception. Typical errors include ' 404 ' (no page found), ' 403 ' (Forbidden Request), ' 401 ' (requires authentication), etc. It contains 2 important attributes reason and code.

    • The program is processed by default for redirection

      Summarize

If just download or display the download progress, do not download the content to do processing, such as downloading pictures, css,js files, etc., you can use Urlilb.urlretrieve ()
If the download request needs to fill out the form, enter the account number, password, etc., it is recommended to use Urllib2.urlopen (urllib2. Request ())
When encoding the dictionary data, the Urllib.urlencode () is used.

Requests

I. Requests uses URLLIB3, which inherits all the features of URLLIB2. Requests supports HTTP connection retention and connection pooling, supports the use of cookies to maintain sessions, supports file uploads, supports automatic encoding of response content, supports internationalized URLs and automatic POST data encoding. II. For example:

     import requests     ...      resp = requests.get (' Http://www.mywebsite.com/user ')      userdata  = {"FirstName":  "John",  "LastName":  "Doe",  "password":  "jdoe123"}      resp = requests.post (' Http://www.mywebsite.com/user ',  params=userdata)      resp = requests.put (' Http://www.mywebsite.com/user/put ')       resp = requests.delete (' Http://www.mywebsite.com/user/delete ')       resp.json ()    #  If the JSON data is returned      resp.text       #返回的不是text数据      resp.headers[' Content-type ']  # Return Text/html;charset=utf-8     f = open (' request_index.html ',  ' W ')      f.write (Page.encode (' UTF8 '))                 #test   found the page that requests grabbed must be coded       #写入, (captured by Unicode), Urllib and URLLIB2 can be written directly,       #因为这两者抓下来的page是str

Iii. Other functional characteristics

Internationalized Domain name and urlskeep-alive & Connection Pool Persistent Cookie session class browser-type SSL encryption authentication basic/Digest Authentication elegant key/value cookies automatically extract Unicode encoded response body multi-segment File upload connection timeout support. Netr C for Python 2.6-3.4 thread safety

Iv. Requests is not a Python-brought library and requires an additional easy_install or pip install

V. Requests defect: Direct use cannot be called asynchronously, slow (from others). The official urllib can replace it.

Vi. individuals do not recommend the use of requests modules

A more detailed description of the relevant

Urllib official website
URLLIB2 official website


This article is from the "Mr_computer" blog, make sure to keep this source http://caochun.blog.51cto.com/4497308/1746987

python--in-depth understanding of urllib, URLLIB2 and requests (requests not recommended?) )

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.