Python crawler Basics (ii) Get and Post methods for URLLIB2 libraries

Source: Internet
Author: User
Tags urlencode ssl certificate

URLLIB2 only supports Http/https's Get and post methods by default

First, get mode

Get requests are generally used for us to obtain data to the server, for example, we use Baidu Search, Baidu search box search "Qin Moon", get the Address bar valid URL is: Https://www.baidu.com/s?wd= Qin Moon

The target URL to get its get by grabbing the package is: https://www.baidu.com/s?wd=%E7%A7%A6%E6%97%B6%E6%98%8E%E6%9C%88

These two URLs are actually the same, the string that appears after WD is the URL code of "Qin Moon", so we can try to use the default get method to send the request.

#responsible for URL encoding processingImportUrllibImportUrllib2url="http://www.baidu.com/s"Word= {"WD":"Qin Time Moon"}#Convert to URL encoding format (string)Word =Urllib.urlencode (Word)#URL The first delimiter is?Newurl = URL +"?"+Word Headers={"user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/51.0.2704.103 safari/537.36"}request= Urllib2. Request (Newurl, headers=headers) Response=Urllib2.urlopen (Request)PrintResponse.read ()

Code execution results is equivalent to the Baidu search box entered the "Qin Moon" after the return, for GET request, use Urllib urlencode the query character URL encoding and then stitching to the full URL and then send the request.

Second, POST request

The request object has the data parameter, it is used in the post, we are going to transmit this parameter data,data is a dictionary, inside to match the key value pair.

Open Youdao translation http://fanyi.youdao.com/, input test data (python), we found that the Address bar URL has not changed, using the Capture tool can get its post request to the destination address:

As a result, we can try to simulate this post request:

ImportUrllibImportUrllib2#the destination URL for the POST requestURL ="Http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"Headers={"user-agent":"mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) gecko/20100101 firefox/47.0"}formdata={ "i":" python",      
" from":"AUTO",
   " to":"AUTO", "smartresult":" dict",
   "client":" fanyideskweb",
   "Salt":" 15082966550971",
   "sign":" 2a6d78290492d163dbd6803b29e2489c",
"DOCTYPE":"JSON", "version":"2.1", "Keyfrom":"Fanyi.web", "Action":"Fy_by_enter", "Typoresult":"true"}data=Urllib.urlencode (formdata) Request= Urllib2. Request (URL, data = data, headers =headers) Response=Urllib2.urlopen (Request)PrintResponse.read ()

This is a simple post request way, can be based on this idea to write a Youdao translator interface program.

Iii. getting Ajax-loaded content

Some Web content is loaded using AJAX, the direct send request does not get the data, and Ajax generally return JSON, you can directly to the Ajax address post or GET, return the JSON data.

Take the Watercress movie leaderboard For example, https://movie.douban.com/typerank?type_name= plot &type=11&interval_id=100:90&action=, Get the URL of the transfer JSON file by grabbing the package:

This gets the URL, where the start and limit two parameters are the loading of the data sent, we can take it separately as a parameter to pass in:

ImportUrllibImportUrllib2URL="https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action"Headers={"user-agent":"mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) gecko/20100101 firefox/47.0"  }

#These two parameters are changed to show the limit data from start
Formdata = {
'Start':'0',
'Limit':'Ten'
}

Data=Urllib.urlencode (Formdata)
Request= Urllib2. Request (URL, data = data, headers =headers)
Response=Urllib2.urlopen (Request)
PrintResponse.read ()

For this dynamic page, we want to focus on the source of the data, or you can use selenium and PHANTOMJS to simulate the browser to get the data.

Iv. SSL authentication to handle HTTPS requests

URLLIB2 can verify the SSL certificate for HTTPS requests, just like a Web browser, if the SSL certificate of the Web site is CA-certified, it can be accessed normally, if SSL certificate validation does not pass, or if the operating system does not trust the server's security certificate, For example, when the browser visits the 12306 website such as: https://www.12306.cn/mormhweb/, it warns the user that the certificate is not trusted.

Import"https://www.12306.cn/mormhweb/"= {"  User-agent""mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.99 safari/537.36"= urllib2. Request (URL, headers == urllib2.urlopen (request)print response.read ()

The following error message appears when you run the program:

Urllib2. Urlerror: <urlopen error [ssl:certificate_verify_failed] CERTIFICATE VERIFY FAILED (_ssl.c:661) >

Therefore, if you encounter such a site in the future, we need to handle the SSL certificate separately, let the program ignore the SSL certificate validation error, can be accessed normally.

ImportUrllibImportUrllib2#Import Python SSL processing moduleImportSSL#Ignore unverified SSL certificate authenticationContext =ssl._create_unverified_context () URL="https://www.12306.cn/mormhweb/"Headers= {"user-agent":"mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.99 safari/537.36"}request= Urllib2. Request (url, headers =headers)#Specify the Add context parameter in the Urlopen () methodResponse = Urllib2.urlopen (Request, context =context)PrintResponse.read ()

Python crawler Basics (ii) Get and Post methods for URLLIB2 libraries

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.