In Python2, there are two libraries of urllib and URLLIB2, unified as Urllib library in Python3
It is a python built-in HTTP request library that contains 4 modules:
- Request: The most basic HTTP request module, used to simulate sending requests, just like the browser incoming URL, the library method to pass the URL and additional parameters can be implemented
- Error: Exception handling module, if a request error occurs, you can catch these exceptions, and then proceed to ensure that the program does not consider terminating
- Parse: A tool module that provides URL handling methods such as splitting, parsing, merging
- Robotparser: The robots.txt file that is used primarily to identify the site, and then determine which sites can crawl, which are not crawled
Official documents:
Https://docs.python.org/3/library/urllib.request.html
Send request:
Urllib.request module:
1, Urlopen ()
The Urllib.request module is the most basic way to construct an HTTP request, which simulates the process of a request initiated by a browser, with processing authorization verification, redirection, browser cookies, and other content
Crawl Python website
To view the types returned
is a httpresponse type that mainly contains methods:
Read ()
Readinto ()
GetHeader (name)
Getheaders ()
Fileno ()
Mainly contains attributes:
Msg
Version
Status
Reason
DebugLevel
Closed
Urlopen () parameter
Urllib.request.urlopen (URL, data=none, [Timeout,]*, Cafile=none, Capath=none, Catefault=false, Context=none)
- Data parameter
- Data is an optional parameter, bytes type, if it is not necessary through the bytes () method to convert, if the data parameter is worn, then the request is the post, and vice versa is get
- The first parameter of the bytes () method needs to be a string type, using the Urllib.parse.urlencode () method to convert the parameter dictionary to a string; the second argument is the pointing encoding format
- The result is a pass-through parameter in the form field that represents the analog form, which is transmitted as a post
Timeout parameter
The user sets the time-out, in seconds, and throws an exception if the request is not responding at this time
Timeout will throw timeout
Catching exceptions
Context: Must be SSL. Sslcontext type, for pointing SSL settings
Cafile, Capath: Specify the CA certificate and its path
Cadefault: Default is False, deprecated
Request parameter
The send request still uses the Urlopen () method, except that the parameter uses the request type Object
Urllib.request.Request (URL, Data=none, headers={}, Origin_req_host=none, Unverifiable=flase, Method=none)
URL: Requested URL, must-pass
Data: Must be a bytes type, if it is a dictionary, first use the UrlEncode () code in the Rullib.parse module
Headers: request header, dictionary form, can be directly constructed in parameters, or can be added by calling the Add_header () method
Origin_req_host: The host name or IP address of the requesting party
Unverifiable: The default is Flase, indicating whether the request is not verifiable, that is, the user has no permissions to receive the results of this request
Method: A string that specifies how the request is to be
Advanced Usage Handler
The Basehandler class in Urllib.request, which is the parent of all other handler, provides the most basic method, Defult_open (), protocol_request (), etc.
Various inherited handler subclasses of this Basehandler parent class:
- Httpdefaulterrorhandler: for handling HTTP response errors, errors will throw exceptions of type Httperror
- Httpredirecthandler: Handling redirects
- Httpcookieprocessor: Handling Cookies
- Proxyhandler: Set proxy, default proxy is empty
- Httppasswordmgr: Manage password, maintain the user name and password of the table
- Httpbasicauthhandler: Management certification, if you need to open the link certification, it can be resolved
- There are other classes that look at all the official documents:
- Httos://docs.python.org/3/library/urllib.request.html#urllib.request.basehandler
Openerdirector class
- The abbreviation Opener,urlopen () method is a opener provided by Urllib
- Request and Urlopen equivalent to the class library for us to encapsulate a number of commonly used methods, which can be used to complete the basic request, want to complete more advanced functions, need to use the lower level of the instance to complete, the use of the opener
- Opener can use the Open () method, as with Urlopen ()
Using handler to build opener
Instance:
1. Verification
For example, when you open a Web site, the box prompts to log in to view the page
- Instantiates the Httppasswordmgrwithdefaultrealm object, which uses Add_password () to add a username password, which establishes a process validation handler
- Use this handler to build a opener using the Build_opener () method, which is the equivalent of a successful validation in the form of a request.
- The opener then uses the open () method to turn on the connection to complete the validation, obtaining the validated page source code
2. Agent
Spiders inevitably do agents, add agents
Using Proxyhandler, the parameter is a dictionary, the key name is the protocol type, such as: http or HTTPS, the value is a proxy link, you can add multiple agents
Then use this handler and Build_opener () method to construct a opener, and then send the request
3. Cookies
Get the cookies from the website and print them
First declare a Cookiejar object, then use Httpcookieprocessor to build a handler, and finally use the Build_opener () method to build opener, execute the Open () method
Output as file format
Cookiejar replaced with Mozillacookiejar, when generating files, is a subclass of Cookiejar, used to handle cookies and file related things, such as reading, saving, cookies can be saved to Mozilla browser cookie form
Lwpcookiejar can also read and save cookies, but the saved format is not the same as Mozillacookiejar, it will be saved as a cookie file in Libwww-perl (LWP) format
Read and use cookies
The Lod () method is used to read local cookie files, get content, and then build handler and opener to complete the operation
Official documents:
Https://docs.python.org/3/library/urllib.request.html#basehandler-objects
Vii. Urllib Library (i)