python2.7
Import Urllib2import Sslweburl="https://www.douban.com/"Webheader= { 'Accept':'text/html, Application/xhtml+xml, */*', # 'accept-encoding':'gzip, deflate', 'Accept-language':'ZH-CN', 'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64; trident/7.0; rv:11.0) Like Gecko', 'DNT':'1', 'Connection':'keep-alive', 'Host':'www.douban.com'}context=Ssl._create_unverified_context () req= Urllib2. Request (Url=weburl, headers=webheader) webpage= Urllib2.urlopen (req, context=context) Data= Webpage.read (). Decode ('Utf-8') Print Dataprint type (data) print type (webpage) print webpage.geturl () print webpage.info () print Webpage.getcode ()
Python 3.6
Import Urllib.requestimport Sslweburl="https://www.douban.com/"Webheader= { 'Accept':'text/html, Application/xhtml+xml, */*', # 'accept-encoding':'gzip, deflate', 'Accept-language':'ZH-CN', 'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64; trident/7.0; rv:11.0) Like Gecko', 'DNT':'1', 'Connection':'keep-alive', 'Host':'www.douban.com'}context=Ssl._create_unverified_context () req= Urllib.request.Request (Url=weburl, headers=webheader) webpage=Urllib.request.urlopen (req,context=context) data= Webpage.read (). Decode ('Utf-8') print (data) print (type (webpage)) Print (Webpage.geturl ( )) print (Webpage.info ()) print (Webpage.getcode ())
Crawl the watercress with the crawler, error" ssl:certificate_verify_failed ", Python upgrade to 2.7. A new feature was introduced after 9. When you use Urllib.urlopen to open an HTTPS link, the SSL certificate is verified once. This exception is thrown when the target Web site is using a self-signed certificate.
The solution has the following two:
1) Create an unauthenticated context with SSL and pass in the context parameter in Urlopen
Import SSL
context = Ssl._create_unverified_context ()
Webpage = Urllib.request.urlopen (req,context=context)
2) Global cancellation certificate validation
Import SSL
Ssl._create_default_https_context = Ssl._create_unverified_context
In addition, if you are using the Get method of the requests module, there is a verify parameter, set it to false.
Fix ' Utf-8 ' codec can ' t decode byte 0x8b in position 1:invalid start byte
' accept-encoding ' ' gzip, deflate ',
This information represents the local can receive the compressed format of data, and the server in the processing of large files are compressed and then sent back to the client, ie after receiving the completion of the file in the local to understand the pressure operation. The reason for the error is that your program does not unzip the file, so deleting this line will not cause problems.
Crawling HTTPS Web sites