The Urllib.request module in Python __python

Source: Internet
Author: User
Tags ftp connection http authentication iterable lowercase rfc urlencode

Because this module is useful when playing Python challenge, and learning this module is also useful to the System Learning web crawler.

At that time looked at a variety of data learning, did not touch the official website documents (because it is still resistant to English), but the official document is the most authoritative and learning value, so want to translate the official documents at the same time, exercise their English ability, but also to deepen understanding of the Urllib module.

Because it is for their own review to be convenient ~ so on a different sentence in English a Chinese control turned over, interested to see the original, their own point official document it ~

Please teach me more about the deficiency of translation

-I am the split line ——- Python 3.x version of Urllib and Urllib2

Now Python is out of 3.5.2.

In the later version of Python 3, urllib2 This module has not been isolated (that is, when you import Urllib2, the system prompts you not to this module), URLLIB2 was merged into the Urllib, called Urllib.request and Urllib.error.

Urllib the whole module is divided into Urllib.request, Urllib.parse, Urllib.error.

Cases:
where Urllib2.urlopen () became the Urllib.request.urlopen ()
Urllib2. Request () becomes the difference between the Urllib.request.Request () urllib and URLLIB2 modules

In Python, urllib and urllib2 cannot be substituted for each other.

Overall, URLLIB2 is a urllib enhancement, but there are functions in the urllib that are not in urllib2.

Urllib2 can modify the header header by setting the request parameter in Urllib2.openurl. If you visit a website and want to change the user Agent (you can disguise your browser), you need to use URLLIB2.

Urllib support to set the encoded function, Urllib.urlencode, when the analog landing, often to post encoded parameters, so you want to do not use a Third-party library to complete the simulation login, you need to use the urllib.

Urllib and Urllib2 with the Official document address

The translation is Python version 3.5.2, corresponding to the Urllib

https://docs.python.org/3.5/library/urllib.html urllib Overall Introduction

Original address: https://docs.python.org/3.5/library/urllib.html

The translation corresponds to the 21.6 urllib.request-extensible library for opening URLs

urllib----- URL processing module

Source code: Lib/urllib/

Urllib is a package that collects several modules to handle URLs:

Urllib.request Open and browse the contents of the URL
Urllib.error contains errors or exceptions that occurred from Urllib.request
Urllib.parse Resolution URL
Urllib.robotparser parsing robots.txt file urllib.request

Original address:
Https://docs.python.org/3.5/library/urllib.request.html#module-urllib.request

urllib.request-an extensible class library for opening URLs

Source code: lib/urllib/request.py

The Urllib.request module defines methods and classes that help open URLs (primarily HTTP) in a complex world-basic and Digest authentication, redirection, cookies, and so on.

————-urllib.request module defines the following functions:————— – urllib.request.urlopen ()

Urllib.request.urlopen (URL, data=none, [Timeout,]*, Cafile=none, Capath=none, Cadefault=false, Context=none)

Open the URL of a URL, which can be a string or a Request object.

Data must be a byte object that specifies additional data to be sent to the server or None. If no such data is necessary, the data may also be a Iterable object and in this case the length of the content must be specified at the very beginning. HTTP is currently the only way to request data, and when data parameters are supplied, the HTTP request will perform a POST request instead of a GET request.

The data should be buffered in a x-www-form-urlencoded format in a standard application. The Urllib.parse.urlencode () function accepts a map or sequence collection and returns the format of an ASCII text string. It should be encoded as a byte before being used as a data parameter.

The Urllib.request module uses the http/1.1 protocol and includes the request Connection:close in the HTTP request header.

The optional second timeout parameter, timeout, is used to block operations such as connection requests (the default timeout setting is used globally if unspecified). This actually applies only to HTTP, HTTPS, and FTP connections.

If the context is specified, it must be an SSL. The Sslcontext instance describes various SSL options. Click Httpsconnection to see more details.

The optional cafile and capath parameters specify a set of CA certificates that are trusted by HTTPS requests. Cafile should point to a package that contains the CA certificate, and Capath should point to the directory of a hashed certificate file. Click SSL. Sslcontext.load_verify_locations () to view more information.

The cadefault parameter is ignored.

This function always returns an object, like the context Manager, and provides these methods

Geturl ()--Returns the URL's resource retrieval, which is often used after redirection

Info ()--Returns the meta information of the page, such as title, email.message_from_string (see Quick Reference HTTP header)

GetCode ()--Returns the HTTP status code of the response.

For HTTP and HTTPS URLs, this function returns a Http.client.HTTPResponse object that is slightly different. In addition to the three new methods above, this message property contains the same information as the reason property-the reason the server returns-rather than the response header, because it specifies HttpResponse in the document.

FTP, file and data request URLs, and explicitly handle the Urlopener and Fancyurlopener classes, this function returns a Urllib.response.addinfourl object.

Urllib.request.urlopen () throws a protocol error in the Urlerror.

Note that none can be returned, which occurs when no handler processes the request (although the global defaults install Openerdirector and use Unknownhandler to ensure that this does not occur).

In addition, if a proxy setting is detected (for example, when a *_proxy environment variable such as Http_proxy has been set), Proxyhandler the default installation and ensures that the request is handled through the proxy.

The legacy of Urllib.urlopen has been interrupted from Python 2.6 and earlier; Urllib.request.urlopen () corresponds to the old Urllib2.urlopen. Agent processing, which is done through the dictionary parameter Urllib.urlopen can use the Proxyhandler object.

3.2 Changes of version: Cafile and Capath are supplemented.

3.2 Version Changes: If possible, now support HTTPS virtual host (that is, if SSL. Has_sni is true).

In the new 3.2 version: The data can be a Iterable object.

3.3 Change of version: Cadefault is supplemented.

Changes in the 3.4.3 version: Context is supplemented. Urllib.request.install_opener (opener)

Installs a Openerdirector instance as the global default opener. Install a opener necessary if you want Urlopen to use this opener, otherwise simply call Openerdirector.open () instead of Urlopen (). This way the code does not check for a real openerdirector and the appropriate interface for any class can work. Urllib.request.build_opener ([Handler, ...])

Returns an instance of the handler openerdirector for a sequence of chains. The handler can be an instance of Basehandler, or a subclass of Basehandler (in which case a constructor without a parameter must be invoked). Instances of these classes will be processed in advance, unless the handler contains them, or an instance of their subclasses: Proxyhandler (if proxy settings are detected), Unknownhandler, HttpHandler, Httpdefaulterrorhandler, Httpredirecthandler, Ftphandler, Filehandler, Httperrorprocessor.

If Python installs SSL support (that is, if the SSL module can be imported), Httpshandler will also be added.

A Basehandler subclass can also modify its position in the list of programs by changing its Handler_order property. urllib.request.pathname2url (path)

Converts a path name to a path, using a path component of a URL from a path in the local syntax form. This does not produce a complete URL. It returns a value that references the quote () function. urllib.request.url2pathname (path)

The syntax for converting a path component to a local path. This does not accept a complete URL. This function uses unquote () to decode the path. urllib.request.getproxies ()

This helper function returns a timeline dictionary to the URL mapping of the proxy server. Scan the specified environment variable _proxy method, for all operating systems, when it cannot find it, look for proxy information from Mac OS X system Configuration and Windows system registry. If two uppercase and lowercase environment variables exist (or not), lowercase takes precedence.

Note that if the environment variable Request_method has been set, this usually indicates that you are running the environment in the CGI script, at which point the environment variable http_proxy (uppercase _proxy) will be ignored. This is because the variable can be injected by the client with the "Agent:" HTTP header. If you need to use an HTTP proxy in a CGI environment, either use Proxyhandler explicitly, or make sure the variable name is lowercase (or at least _proxy suffix).

-- provides the following classes:————————————— class urllib.request.Request (URL, Data=none, headers={}, Origin_req_host=none, Unverifiable=false, Method=none)

This class is an abstract URL request.

The URL should be a string containing a valid URL.

Data must be a byte object that specifies additional data to be sent to the server or None. If no such data is necessary, the data may also be a Iterable object and in this case the length of the content must be specified at the very beginning. HTTP is currently the only way to request data, and when data parameters are supplied, the HTTP request will perform a POST request instead of a GET request.

The data should be buffered in a x-www-form-urlencoded format in a standard application. The Urllib.parse.urlencode () function accepts a map or sequence collection and returns the format of an ASCII text string. It should be encoded as a byte before being used as a data parameter.

Headers should be a dictionary if Add_header () is called with each key and value as an argument. This is usually used to "spoof" the value of the user-agent header, because using a browser to identify itself-some common HTTP servers allow only requests from browsers rather than scripts. For example, Mozilla Firefox may recognize itself as "mozilla/5.0" (X11; U Linux i686) gecko/20071127 firefox/2.0.0.11 ". The Urllib Default User agent string is "python-urllib/2.6" in Python 2.6 ().

An example of a content-type header with a data argument will send a dictionary {"Content-type": "Application/x-www-form-urlencoded"}.

The last two parameters simply handle the Third-party HTTP cookies correctly:

Origin_req_host should request the original host transaction, as defined in RFC 2965. It defaults to http.cookiejar.request_host (self). This is the host name or IP address of the original request, initiated by the user. For example. If the request is an image in an HTML document, this should be the requested request host containing the image of the page.

The inability to verify that the request is not verifiable is defined by RFC 2965. It defaults to False. A user who cannot verify the URL of a request has no allowed choice. For example, if the request is an image in an HTML document, and the user does not choose to automatically crawl through the image, this should be correct.

This method should be a string indicating that it will be used, such as the HTTP request method. ' Head '). If provided, its value is stored in the method property and using Get_method (). Setting a subclass may indicate a default method to the properties of the class itself.

Changes in version 3.3:: Request.method parameter is added to the request class.

Changes to version 3.4: The default Request.method may appear at the class level. class Urllib.request.OpenerDirector

The Openerdirector class opens URLs and is connected by Basehandler. It manages the connection of handlers and restores errors. class Urllib.request.BaseHandler

This is the base class class for all registered handlers Urllib.request.HTTPRedirectHandler

A class to handle redirect class Urllib.request.HTTPCookieProcessor (Cookiejar=none)

A class to handle HTTP cookies. class Urllib.request.ProxyHandler (Proxies=none)

Causes the request to pass through an agent. If the proxy is given, it must be a dictionary of proxy protocol names mapped to URLs. The default value is to read the agent from the list _proxy of the environment variable. If no agent sets environment variables, then in the Windows environment The agent sets the network settings from the Registry section to the OS X System Configuration framework of the Mac OS X Environment agent information retrieval.

Disables the passing of an empty dictionary by an agent.

No_proxy environment variables can be used to specify that a host cannot pass a proxy, and if so, it should be a comma-delimited hostname suffix. Optional: Port attached as an example cern.ch,ncsa.uiuc.edu,some.host:8080.

Note that http_proxy if a variable is ignored request_method set; see document Getproxies (). class Urllib.request.HTTPPasswordMgr

maintains a database (realm, URI)-> (user, password) mapping. class Urllib.request.HTTPPasswordMgrWithDefaultRealm

maintains a database (realm, URI)-> (user, password) mapping. A domain none is considered to be a omni-directional domain If no other search domain class Urllib.request.HTTPPasswordMgrWithPriorAuth

A Variant Httppasswordmgrwithdefaultrealm also has a mapping of the database URI-> is_authenticated. You can use the BasicAuth handler to determine that the authentication credentials are sent immediately instead of waiting for a 401 response. class Urllib.request.AbstractBasicAuthHandler (Password_mgr=none)

This is the Mixin class that helps with HTTP authentication, remote hosts, and proxies. Fruit has password_mgr, should be compatible with Httppasswordmgr. See the interface information that the partial Httppasswordmgr object must support. If Passwd_mgr also provides is_authenticated and Update_authenticated methods (see Httppasswordmgrwithpriorauth object), the handler will then use the Is_ The authenticated result determines whether the requested authentication credentials are sent for a given URI. If Is_authenticated returns Trueuri, the voucher is sent. If is_authenticated is false the voucher is not sent, and then if 401 receives a response request to send the authentication credentials. If authentication succeeds, update_authenticated is referred to as setting is_authenticated Trueuri, so the URI of the subsequent request or any Super-uris will automatically include the authentication credentials.

In the new version 3.5: Add is_authenticated support. class Urllib.request.HTTPBasicAuthHandler (Password_mgr=none)

Handles authentication with the remote host. If there is password_mgr, it should be compatible with Httppasswordmgr. See the interface information that the partial Httppasswordmgr object must support. Httpbasicauthhandler will improve valueerror when faced with an incorrect authentication scheme. class Urllib.request.ProxyBasicAuthHandler (Password_mgr=none)

Processing and authentication of proxy identities. If there is password_mgr, it should be compatible with Httppasswordmgr. See the interface information that the partial Httppasswordmgr object must support. class Urllib.request.AbstractDigestAuthHandler (Password_mgr=none)

This is the Mixin class that helps with HTTP authentication, remote hosts, and proxies. Password_mgr, if any, should be compatible httppasswordmgr; see the interface information that some httppasswordmgr objects must support Class Urllib.request.HTTPDigestAuthHandler (Password_mgr=none)

Handles authentication with the remote host. If there are password_mgr, it should be compatible with httppasswordmgr; see the interface information that some Httppasswordmgr objects must support. Digest authentication handlers and Basic authentication processors are all added, Digest authentication is always attempted for the first time. If the host returns a x again, it is sent to the Basic authentication handler to process. This handler method will improve valueerror when faced except for digestion or Basic authentication schemes.

Changes in version 3.3: Increased ValueError unsupported authentication schemes. class Urllib.request.ProxyDigestAuthHandler (Password_mgr=none)

Processing and proxy authentication. If there are password_mgr, it should be compatible with httppasswordmgr; see the interface information classes that some Httppasswordmgr objects must support Urllib.request.HTTPHandler

A class to handle HTTP URL class Urllib.request.HTTPSHandler (Debuglevel=0, Context=none, Check_hostname=none)

A class to handle the open HTTPS URL. In the context and check_hostname have the same meaning http.client.HTTPSConnection.

Changes in version 3.2: Context and check_hostname are supplemented. class Urllib.request.FileHandler

Open local file class Urllib.request.DataHandler

Open Data URL class Urllib.request.FTPHandler

Open FTP URL class Urllib.request.CacheFTPHandler

Open FTP URL, keep open FTP connection cache to reduce latency class Urllib.request.UnknownHandler

The Omni-directional class handles unknown URLs. class Urllib.request.HTTPErrorProcessor

HTTP error response process.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.