One, SSL certificate issues
In the previous article, we created a small reptile that downloaded several web pages of the Shanghai Chain Home property. In fact, when we use the Urllib networking process, we encounter a problem with restricted access to the certificate.
Handling HTTPS requests for SSL certificate validation, if SSL certificate validation does not pass, warns the user that the certificate is untrusted (that is, no AC authentication is passed).
On the left we can see the SSL authentication failed, so we will have to deal with the SSL certificate in the future, let the program actively ignore the SSL certificate validation error, can be accessed normally. For example we visit 12306.
1 fromUrllibImportRequest2 #import python's SSL processing module3 ImportSSL4 5 #Ignore SSL validation failure6context=Ssl._create_unverified_context ()7 8Url="https://www.12306.cn/mormhweb/"9 TenResponse=request.urlopen (url,context=context) OneHtml=Response.read () A Print(HTML)
Second, handler processor and custom opener
We have been using the Urlopen, it is a module to help us build a good special opener. But this basic urlopen () is not supported by proxies, cookies and other Http/https advanced features. So we need to use the handler processor to customize the opener to meet the features we need.
1 Importurllib.request2 3Url="http://www.whatismyip.com.tw/"4 5 #The parameter is a dictionary type, the key represents the type of proxy, and the value is the proxy IP and port number6Proxy_support=urllib.request.proxyhandler ({'http':'117.86.199.19:8118'})7 8 #then create a opener that contains the proxy9Opener=Urllib.request.build_opener (Proxy_support)Tenopener.addheaders=[("user-agent","mozilla/5.0 (Macintosh; u;intelmacosx10_6_8;en-us) applewebkit/534.50 (Khtml,likegecko) version/5.1safari/534.50")] One A - #The first is to use Install_opener () to install into the default environment, and later you use the Urlopen () function, which is done with your custom opener . - Urllib.request.install_opener (opener) theResponse=urllib.request.urlopen (URL) - - #the second one uses a disposable opener.open () to open - #req=urllib.request.request (URL) + #Response=opener.open (req) - +Html=response.read (). Decode ('Utf-8') A Print(HTML)
We can see that the IP that visited the website has been replaced by proxy IP. In the above setup agent process, we also use addheaders this function, to the request attached useragent,useragent Chinese name for the user agent, is part of the HTTP protocol, belongs to the head domain components, useragent also referred to as UA. It is a special string header that provides access to the Web site with information such as the type and version of the browser you are using, the operating system and version, the browser kernel, and so on. This is also one of the most common means of countering anti-reptiles.
Python crawler (3)--ssl certificate and handler processor