Vii. Urllib Library (i)

Source: Internet
Author: User

In Python2, there are two libraries of urllib and URLLIB2, unified as Urllib library in Python3

It is a python built-in HTTP request library that contains 4 modules:

    • Request: The most basic HTTP request module, used to simulate sending requests, just like the browser incoming URL, the library method to pass the URL and additional parameters can be implemented
    • Error: Exception handling module, if a request error occurs, you can catch these exceptions, and then proceed to ensure that the program does not consider terminating
    • Parse: A tool module that provides URL handling methods such as splitting, parsing, merging
    • Robotparser: The robots.txt file that is used primarily to identify the site, and then determine which sites can crawl, which are not crawled

Official documents:

Https://docs.python.org/3/library/urllib.request.html

Send request:

Urllib.request module:

1, Urlopen ()

The Urllib.request module is the most basic way to construct an HTTP request, which simulates the process of a request initiated by a browser, with processing authorization verification, redirection, browser cookies, and other content

Crawl Python website

To view the types returned

is a httpresponse type that mainly contains methods:

Read ()

Readinto ()

GetHeader (name)

Getheaders ()

Fileno ()

Mainly contains attributes:

Msg

Version

Status

Reason

DebugLevel

Closed

Urlopen () parameter

Urllib.request.urlopen (URL, data=none, [Timeout,]*, Cafile=none, Capath=none, Catefault=false, Context=none)

    • Data parameter
    • Data is an optional parameter, bytes type, if it is not necessary through the bytes () method to convert, if the data parameter is worn, then the request is the post, and vice versa is get

    • The first parameter of the bytes () method needs to be a string type, using the Urllib.parse.urlencode () method to convert the parameter dictionary to a string; the second argument is the pointing encoding format
    • The result is a pass-through parameter in the form field that represents the analog form, which is transmitted as a post

Timeout parameter

The user sets the time-out, in seconds, and throws an exception if the request is not responding at this time

Timeout will throw timeout

Catching exceptions

Context: Must be SSL. Sslcontext type, for pointing SSL settings

Cafile, Capath: Specify the CA certificate and its path

Cadefault: Default is False, deprecated

Request parameter

The send request still uses the Urlopen () method, except that the parameter uses the request type Object

Urllib.request.Request (URL, Data=none, headers={}, Origin_req_host=none, Unverifiable=flase, Method=none)

URL: Requested URL, must-pass

Data: Must be a bytes type, if it is a dictionary, first use the UrlEncode () code in the Rullib.parse module

Headers: request header, dictionary form, can be directly constructed in parameters, or can be added by calling the Add_header () method

Origin_req_host: The host name or IP address of the requesting party

Unverifiable: The default is Flase, indicating whether the request is not verifiable, that is, the user has no permissions to receive the results of this request

Method: A string that specifies how the request is to be

Advanced Usage Handler

The Basehandler class in Urllib.request, which is the parent of all other handler, provides the most basic method, Defult_open (), protocol_request (), etc.

Various inherited handler subclasses of this Basehandler parent class:

    • Httpdefaulterrorhandler: for handling HTTP response errors, errors will throw exceptions of type Httperror
    • Httpredirecthandler: Handling redirects
    • Httpcookieprocessor: Handling Cookies
    • Proxyhandler: Set proxy, default proxy is empty
    • Httppasswordmgr: Manage password, maintain the user name and password of the table
    • Httpbasicauthhandler: Management certification, if you need to open the link certification, it can be resolved
    • There are other classes that look at all the official documents:
    • Httos://docs.python.org/3/library/urllib.request.html#urllib.request.basehandler

Openerdirector class

    • The abbreviation Opener,urlopen () method is a opener provided by Urllib
    • Request and Urlopen equivalent to the class library for us to encapsulate a number of commonly used methods, which can be used to complete the basic request, want to complete more advanced functions, need to use the lower level of the instance to complete, the use of the opener
    • Opener can use the Open () method, as with Urlopen ()

Using handler to build opener

Instance:

1. Verification

For example, when you open a Web site, the box prompts to log in to view the page

    • Instantiates the Httppasswordmgrwithdefaultrealm object, which uses Add_password () to add a username password, which establishes a process validation handler
    • Use this handler to build a opener using the Build_opener () method, which is the equivalent of a successful validation in the form of a request.
    • The opener then uses the open () method to turn on the connection to complete the validation, obtaining the validated page source code

2. Agent

Spiders inevitably do agents, add agents

Using Proxyhandler, the parameter is a dictionary, the key name is the protocol type, such as: http or HTTPS, the value is a proxy link, you can add multiple agents

Then use this handler and Build_opener () method to construct a opener, and then send the request

3. Cookies

Get the cookies from the website and print them

First declare a Cookiejar object, then use Httpcookieprocessor to build a handler, and finally use the Build_opener () method to build opener, execute the Open () method

Output as file format

Cookiejar replaced with Mozillacookiejar, when generating files, is a subclass of Cookiejar, used to handle cookies and file related things, such as reading, saving, cookies can be saved to Mozilla browser cookie form

Lwpcookiejar can also read and save cookies, but the saved format is not the same as Mozillacookiejar, it will be saved as a cookie file in Libwww-perl (LWP) format

Read and use cookies

The Lod () method is used to read local cookie files, get content, and then build handler and opener to complete the operation

Official documents:

Https://docs.python.org/3/library/urllib.request.html#basehandler-objects

Vii. Urllib Library (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.