Changes and introduction of urllib modules in Python3

Source: Internet
Author: User
Tags ftp connection http authentication iterable rfc urlencode

78957901

Python 3.x versions of Urllib and URLLIB2

Now Python is out of the 3.5.2.

In Python version 3, urllib2 This module is no longer separate (that is, when you import Urllib2, the system prompts you to not have this module), URLLIB2 was merged into the Urllib, called Urllib.request and Urllib.error.

Urllib the whole module is divided into Urllib.request, Urllib.parse, Urllib.error.

Cases:
where Urllib2.urlopen () becomes Urllib.request.urlopen ()
Urllib2. Request () becomes urllib.request.Request ()

The difference between Urllib and URLLIB2 modules
    1. In Python, urllib and urllib2 are not mutually replaceable.

    2. Overall, URLLIB2 is an enhancement of urllib, but there are functions in urllib that are not in the URLLIB2.

    3. Urllib2 can modify header headers by setting the request parameter in Urllib2.openurl. If you visit a website and want to change the user Agent (which can disguise your browser), you need to use URLLIB2.

    4. Urllib supports the function of setting the encoding, Urllib.urlencode, when the simulation landing, often to post code after the parameters, so to do not use a third-party library to complete the simulation login, you need to use Urllib.

    5. Urllib is generally used in conjunction with the URLLIB2

Official Document Address

Python 3.5.2 version, corresponding to the Urllib

Https://docs.python.org/3.5/library/urllib.html

Urllib Overall Introduction

Original address: https://docs.python.org/3.5/library/urllib.html

The corresponding translation is 21.6. Urllib.request-extensible Library for opening URLs

urllib------URL processing module

Source code: Lib/urllib/

Urllib is a package that collects several modules to handle URLs:

Urllib.request opening and browsing content in URLs
Urllib.error contains errors or exceptions that occurred from Urllib.request
Urllib.parse parsing URLs
Urllib.robotparser parsing robots.txt files

Urllib.request

Original address:
Https://docs.python.org/3.5/library/urllib.request.html#module-urllib.request

Urllib.request-Extensible class Library provided for open URLs

Source code: lib/urllib/request.py

The Urllib.request module defines methods and classes that help open URLs (primarily HTTP) in a complex world-basic and Digest authentication, redirection, cookies, and so on.

The ————-urllib.request module defines the following features: ————— –

Urllib.request.urlopen ()

Urllib.request.urlopen (URL, data=none, [Timeout,]*, Cafile=none, Capath=none, Cadefault=false, Context=none)

Open URL URL, which can be a string or a Request object.

The data must be a byte object to specify that additional data is sent to the server or None. If no such data is necessary, the data may also be a Iterable object and in this case the length of the content must be specified at the beginning. Currently HTTP is the only one that requests data, and when the data parameter is provided, the HTTP request executes the POST request instead of the GET request.

The data should be buffered in a standard application in x-www-form-urlencoded format. The Urllib.parse.urlencode () function accepts a map or sequence collection and returns the format of an ASCII text string. It should be encoded as a byte before it is used as a data parameter.

The Urllib.request module uses the http/1.1 protocol and includes the request Connection:close in the HTTP request header.

The optional second timeout parameter, timeout, is used to block operations such as connection requests (the default timeout setting is used globally if not specified). This actually applies only to HTTP, HTTPS, and FTP connections.

If the context is specified, it must be an SSL. The Sslcontext instance describes various SSL options. Click Httpsconnection to see more details.

The optional cafile and Capath parameters specify a set of CA certificates that are trusted by HTTPS requests. Cafile should point to a file that contains the CA certificate for the package, and Capath should point to a hashed certificate file directory. Click SSL. Sslcontext.load_verify_locations () to view more information.

The Cadefault parameter is ignored.

This function always returns an object, like a context manager, and provides these methods

    • Geturl ()--Returns the resource retrieval of the URL, which is often redirected after use

    • Info ()--Returns the meta-information of the page, such as the title, to form an instance of email.message_from_string (see Quick Reference HTTP header)

    • GetCode ()--Returns the HTTP status code for the response.

For HTTP and HTTPS URLs, this function returns a slightly different Http.client.HTTPResponse object. In addition to the three new methods above, this message property contains the same information as the reason property-the reason returned by the server-rather than the response header because it specifies HttpResponse in the document.

FTP, file, and data request URLs and explicitly handle the Urlopener and Fancyurlopener classes, which returns a Urllib.response.addinfourl object.

Urllib.request.urlopen () throws a protocol error in the Urlerror.

Note that it is possible to return none, which occurs when no handler processes the request (although the global default installs Openerdirector and uses Unknownhandler to ensure that this does not occur).

In addition, if proxy settings are detected (for example, when a *_proxy environment variable such as Http_proxy has been set), the Proxyhandler is installed by default and ensures that requests are handled through the proxy.

The legacy Urllib.urlopen from Python 2.6 and earlier has been interrupted;; Urllib.request.urlopen () corresponds to the old Urllib2.urlopen. Agent processing, which is done through the dictionary parameters Urllib.urlopen can use the Proxyhandler object.

Changes in version 3.2: Cafile and Capath are supplemented.

Version 3.2 changes: If possible, now supports HTTPS virtual host (that is, if SSL. Has_sni is true).

In the new 3.2 version: The data can be a Iterable object.

Change of version 3.3: Cadefault is supplemented.

Changes in 3.4.3 Version: Context is added.

Urllib.request.install_opener (opener)

Installs a Openerdirector instance as the global default opener. Install a opener necessary if you want Urlopen to use this opener; otherwise, simply call Openerdirector.open () instead of Urlopen (). This way the code does not check for a real openerdirector and the proper interface for any class can work.

Urllib.request.build_opener ([Handler, ...])

Returns an instance of the handler openerdirector for a sequential chain. A handler can be an instance of Basehandler, or a subclass of Basehandler (in which case a constructor without parameters must be called). The following instances of these classes will be processed in advance, unless the handlers contain them, or instances of their subclasses: Proxyhandler (if proxy settings are detected), Unknownhandler, HttpHandler, Httpdefaulterrorhandler, Httpredirecthandler, Ftphandler, Filehandler, Httperrorprocessor.

If Python installs SSL support (that is, if the SSL module can be imported), Httpshandler will be added as well.

A Basehandler subclass can also modify its position in the program list by changing its Handler_order property.

Urllib.request.pathname2url (PATH)

Converts a path name to a path, using a path component of a URL from a path in local syntax form. This does not result in a full URL. It returns the value that references the quote () function.

Urllib.request.url2pathname (PATH)

The syntax for converting a path component to a local path. This does not accept a full URL. This function uses unquote () to decode the path.

Urllib.request.getproxies ()

This helper function returns a schedule dictionary URL mapping to the proxy server. Scan the specified environment variable _proxy case insensitive method, for all operating systems, when it cannot find it, look for proxy information from Mac OS X's Mac OSX system configuration and Windows system registry. Lowercase takes precedence if two uppercase and lowercase environment variables exist (or are not the same).

Note that if the environment variable Request_method is already set, this usually indicates that you are running the environment in a CGI script, where the environment variable http_proxy (uppercase _proxy) is ignored. This is because the variable can be injected by the client using the "proxy:" HTTP header. If you need to use an HTTP proxy in a CGI environment, either use Proxyhandler explicitly, or make sure the variable name is lowercase (or at least the _proxy suffix).

--Provides the following classes: —————————————

Class Urllib.request.Request (URL, Data=none, headers={}, Origin_req_host=none, Unverifiable=false, Method=none)

This class is an abstract URL request.

The URL should be a string containing a valid URL.

The data must be a byte object to specify that additional data is sent to the server or None. If no such data is necessary, the data may also be a Iterable object and in this case the length of the content must be specified at the beginning. Currently HTTP is the only one that requests data, and when the data parameter is provided, the HTTP request executes the POST request instead of the GET request.

The data should be buffered in a standard application in x-www-form-urlencoded format. The Urllib.parse.urlencode () function accepts a map or sequence collection and returns the format of an ASCII text string. It should be encoded as a byte before it is used as a data parameter.

Headers should be a dictionary if Add_header () is called with each key and value as a parameter. This is usually used to "spoof" the value of the user-agent header, because using a browser to identify itself-some common HTTP servers only allow requests from browsers rather than scripts. For example, Mozilla Firefox may recognize itself as "mozilla/5.0 (X11; U Linux i686) gecko/20071127 firefox/2.0.0.11 ". The urllib Default User-agent string is "python-urllib/2.6" in Python 2.6 ().

An example of a Content-type header uses a data argument to send a dictionary {"Content-type": "Application/x-www-form-urlencoded"}.

The last two parameters just handle third-party HTTP cookies correctly:

Origin_req_host should request the original host transaction, just like the definition of RFC 2965. It defaults to http.cookiejar.request_host (self). This is the host name or IP address of the original request, initiated by the user. For example. If the request is an image in an HTML document, this should be the requested host page that contains the image.

The inability to verify that the request should be made is not verifiable, as defined by RFC 2965. It defaults to False. A user who cannot verify the URL of the request does not have the allowed option. For example, if the request is an image in an HTML document, and the user does not select an automatic fetch through the image, this should be correct.

This method should be a string that will be used (such as the HTTP request method). ' HEAD '). If provided, its value is stored in the method property and is used with Get_method (). By setting a subclass may indicate a default method for the property of the class itself.

Version 3.3 change:: Request.method parameter is added to the request class.

Changes to version 3.4: The default Request.method may be displayed at the class level.

Class Urllib.request.OpenerDirector

The Openerdirector class opens the URL and connects through the Basehandler. It manages the connection of handlers, and restores errors.

Class Urllib.request.BaseHandler

This is the base class for all registered handlers

Class Urllib.request.HTTPRedirectHandler

A class to handle redirection

Class Urllib.request.HTTPCookieProcessor (Cookiejar=none)

A class to handle HTTP cookies.

Class Urllib.request.ProxyHandler (Proxies=none)

Causes the request to pass through a proxy. If the proxy is given, it must be a dictionary of proxy protocol names mapped to URLs. The default value is to read the proxy from the list of environment variables _proxy. If there is no proxy settings environment variable, then in the Windows environment The agent sets the network settings from the Registry section, and the Mac OS X environment proxy information is retrieved by the OS x System Configuration framework.

Disables a proxy that passes an empty dictionary.

The NO_PROXY environment variable can be used to specify that the host cannot pass the proxy, and if it is set, it should be a comma-separated hostname suffix. Optional: Port attached as example cern.ch,ncsa.uiuc.edu,some.host:8080.

Note Http_proxy If a variable is to be ignored request_method set; see document Getproxies ().

Class Urllib.request.HTTPPasswordMgr

maintains a database (realm, URI)--(user, password) mapping.

Class Urllib.request.HTTPPasswordMgrWithDefaultRealm

maintains a database (realm, URI)--(user, password) mapping. An area of none is considered to be a full-scale field, if no other search field

Class Urllib.request.HTTPPasswordMgrWithPriorAuth

A Variant Httppasswordmgrwithdefaultrealm also has a mapping of database URI---is_authenticated. You can use the BasicAuth handler to determine when the authentication credentials are sent immediately instead of waiting for a 401 response.

Class Urllib.request.AbstractBasicAuthHandler (Password_mgr=none)

This is the Mixin class that helps with HTTP authentication, remote hosts, and proxies. The fruit has password_mgr, should be compatible with Httppasswordmgr. See the section Httppasswordmgr objects that must support interface information. If Passwd_mgr also provides the is_authenticated and Update_authenticated methods (see Httppasswordmgrwithpriorauth object), then the handler will use the Is_ Authenticated results for a given URI to decide whether to send the requested authentication credentials. If Is_authenticated returns Trueuri, the voucher is sent. If is_authenticated is false, the credentials are not sent, and then if 401 receives a response request to send the authentication voucher. If authentication succeeds, update_authenticated is referred to as setting is_authenticated Trueuri so that subsequent requests to the URI or any super-uris will automatically include the authentication credentials.

In the new 3.5 version: Add is_authenticated support.

Class Urllib.request.HTTPBasicAuthHandler (Password_mgr=none)

Handles authentication with the remote host. If there is password_mgr, it should be compatible with Httppasswordmgr. See the section Httppasswordmgr objects that must support interface information. Httpbasicauthhandler will raise valueerror when faced with a wrong authentication scheme.

Class Urllib.request.ProxyBasicAuthHandler (Password_mgr=none)

Handles authentication with the proxy identity. If there is password_mgr, it should be compatible with Httppasswordmgr. See the section Httppasswordmgr objects that must support interface information.

Class Urllib.request.AbstractDigestAuthHandler (Password_mgr=none)

This is the Mixin class that helps with HTTP authentication, remote hosts, and proxies. Password_mgr, if any, should be compatible with httppasswordmgr; see section Httppasswordmgr Objects must support interface information

Class Urllib.request.HTTPDigestAuthHandler (Password_mgr=none)

Handles authentication with the remote host. If there is password_mgr, it should be compatible with httppasswordmgr; see the interface information that some Httppasswordmgr objects must support. Digest authentication handlers and Basic authentication processors are both added that Digest authentication always tries the first time. If the host returns a + x response again, it is sent to the Basic authentication handler to process. This handler method will improve valueerror when faced with the exception of digestion or Basic authentication schemes.

Changes in version 3.3: Improved authentication schemes not supported by ValueError.

Class Urllib.request.ProxyDigestAuthHandler (Password_mgr=none)

Handle authentication with proxy. If there is password_mgr, it should be compatible with httppasswordmgr; see the interface information that some Httppasswordmgr objects must support

Class Urllib.request.HTTPHandler

A class to handle HTTP URLs

Class Urllib.request.HTTPSHandler (Debuglevel=0, Context=none, Check_hostname=none)

A class to handle the open HTTPS URL. In the context text and check_hostname have the same meaning http.client.HTTPSConnection.

Version 3.2 changes: Context and check_hostname are supplemented.

Class Urllib.request.FileHandler

Open Local File

Class Urllib.request.DataHandler

URL for Open Data

Class Urllib.request.FTPHandler

An open FTP URL

Class Urllib.request.CacheFTPHandler

Open FTP URL, keep open FTP connection cache to reduce latency

Class Urllib.request.UnknownHandler

The full class handles unknown URLs.

Class Urllib.request.HTTPErrorProcessor

HTTP error response procedure.

Changes and introduction of urllib modules in Python3

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.