Crawling HTTPS Web sites

Source: Internet
Author: User

python2.7

Import Urllib2import Sslweburl="https://www.douban.com/"Webheader= {    'Accept':'text/html, Application/xhtml+xml, */*',    # 'accept-encoding':'gzip, deflate',    'Accept-language':'ZH-CN',    'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64; trident/7.0; rv:11.0) Like Gecko',    'DNT':'1',    'Connection':'keep-alive',    'Host':'www.douban.com'}context=Ssl._create_unverified_context () req= Urllib2. Request (Url=weburl, headers=webheader) webpage= Urllib2.urlopen (req, context=context) Data= Webpage.read (). Decode ('Utf-8') Print Dataprint type (data) print type (webpage) print webpage.geturl () print webpage.info () print Webpage.getcode () 

Python 3.6

Import Urllib.requestimport Sslweburl="https://www.douban.com/"Webheader= {    'Accept':'text/html, Application/xhtml+xml, */*',    # 'accept-encoding':'gzip, deflate',    'Accept-language':'ZH-CN',    'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64; trident/7.0; rv:11.0) Like Gecko',    'DNT':'1',    'Connection':'keep-alive',    'Host':'www.douban.com'}context=Ssl._create_unverified_context () req= Urllib.request.Request (Url=weburl, headers=webheader) webpage=Urllib.request.urlopen (req,context=context) data= Webpage.read (). Decode ('Utf-8') print (data) print (type (webpage)) Print (Webpage.geturl ( )) print (Webpage.info ()) print (Webpage.getcode ())

Crawl the watercress with the crawler, error" ssl:certificate_verify_failed ", Python upgrade to 2.7. A new feature was introduced after 9. When you use Urllib.urlopen to open an HTTPS link, the SSL certificate is verified once. This exception is thrown when the target Web site is using a self-signed certificate.

The solution has the following two:

1) Create an unauthenticated context with SSL and pass in the context parameter in Urlopen

Import SSL

context = Ssl._create_unverified_context ()

Webpage = Urllib.request.urlopen (req,context=context)

2) Global cancellation certificate validation

Import SSL

Ssl._create_default_https_context = Ssl._create_unverified_context

In addition, if you are using the Get method of the requests module, there is a verify parameter, set it to false.

Fix ' Utf-8 ' codec can ' t decode byte 0x8b in position 1:invalid start byte

' accept-encoding ' ' gzip, deflate ',

This information represents the local can receive the compressed format of data, and the server in the processing of large files are compressed and then sent back to the client, ie after receiving the completion of the file in the local to understand the pressure operation. The reason for the error is that your program does not unzip the file, so deleting this line will not cause problems.

Crawling HTTPS Web sites

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.