Request Library Urllib
Urllib is mainly divided into several parts
Urllib.request Sending requests
Urllib.error the exception that occurred during the processing of the request
Urllib.parse Processing URLs
Urllib.robotparser parsing Robots.txt------the site's crawler permissions are defined
Urllib.request method
data = Urllib.request.urlopen (URL) # Returns the response object
Data.read ()---> Remove the Web source code (bytes type, can be decode () to Utf-8) Note that the page source code does not include JS processed data
Data.info ()---> Remove the response header information
Data.getcode ()---> Remove return code
Data.geturl ()---> Take out the requested URL
The request from the script, headers in the user-agent is python-urllib/3.6, some sites will be based on this to identify whether the request is a script issued, and then filter out the crawler, then how do we simulate the browser access it?
Use Urllib.request.Request () to carry headers header information
headers = {'user=agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/65.0.3325.181 safari/537.36'# instance request () object, request object initialization can pass in headers # Urlopen not only can pass in a URL, but can pass in a Request object
How does urllib.request differentiate between get or post on a request?
Look at the source code of Urllib.request, you can know if the request carries the data parameter, it is a POST request, and vice versa is a GET request
Use of cookies
Fixed notation
Import Http.cookiejar # Create Cookiejar object Cookie_jar = http.cookiejar.CookieJar ()# Use Httpcookieprocessor to create a cookie processor and build opener objects with it as parameters opener = Urllib.request.build_opener ( Urllib.request.HTTPCookieProcessor (Cookie_jar))# turn opener into Urlopen, after installation, add a Save cookie function to Urlopen # If you don't want to install it, open it directly with Opener.open (URL)
Set up Proxy
Fixed notation
Proxy = {'http':'183.232.188.18:80','HTTPS':'183.232.188.18:80'}#Agent Ip:port#Creating an Agent processorproxies =Urllib.request.ProxyHandler (proxy)#Create a Opener objectOpener =Urllib.request.build_opener (Proxies,urllib.request.httphandler) Urllib.request.install_opener (opener)#If you do not want to install, you can open it directly with Opener.open (URL), so that opener has an agent and Urlopen noUrllib.error
Urlerror is the parent class is disconnected or the server does not exist for an exception reason, there is no code attribute
Httperror is a subclass server exists, but the address does not exist, with the code and reason attributes
# url = ' http://www.adfdsfdsdsa.com ' urlerror # url = ' https://jianshu.com/p/jkhsdhgjkasdhgjkadfhg ' httperror
Judging method
Try: Data=urllib.request.urlopen (URL)Print(Data.read (). Decode ())exceptUrllib.error.URLError as E:ifHasattr (E,'Code'): Print('Httperror') elifHasattr (E,'reason'): Print('Urlerror')Urllib.parse
ImportUrllib.parse#Urllib.parse.urljoin () # stitching URL string concatenation, if all are domain names, then overwrite with the back#Urllib.parse.urlencode () # Turn the dictionary into a query string#urllib.parse.quote () # URL is encoded in ASCII code and requires URL encoding when Chinese is present#urllib.parse.unquote () # URL decoding#UrlEncode>>> fromUrllibImportParse>>> query = {'name':'Walker',' Age': 99,}>>>parse.urlencode (query)'name=walker&age=99'Quote/Quote_plus>>> fromUrllibImportParse>>> Parse.quote ('a&b/c')#non-coded slash'a%26b/c'>>> Parse.quote_plus ('a&b/c')#code a slash'A%26B%2FC'unquote/Unquote_plus fromUrllibImportParse>>> Parse.unquote ('1+2')#do not decode the plus sign'1+2'>>> Parse.unquote ('1+2')#decode the plus sign to a space'1 2'URL='http://www.baidu.com/?wd= schoolbag'urllib.request.urlopen (URL)#will error coding errors#if you want to request correctly, you need URL = ' http://www.baidu.com/?wd={} '. Format (urllib.parse.quote (' schoolbag ')) First encode the Chinese URL before requestingUrllib3
Import== http.request ('GET','https:// www.jianshu.com'# turn off redirect print
Python Crawler's Urllib library