Python crawler Basics (ii) Get and Post methods for URLLIB2 libraries

Last Update:2017-10-18 Source: Internet

Author: User

Tags urlencode ssl certificate

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

URLLIB2 only supports Http/https's Get and post methods by default

First, get mode

Get requests are generally used for us to obtain data to the server, for example, we use Baidu Search, Baidu search box search "Qin Moon", get the Address bar valid URL is: Https://www.baidu.com/s?wd= Qin Moon

The target URL to get its get by grabbing the package is: https://www.baidu.com/s?wd=%E7%A7%A6%E6%97%B6%E6%98%8E%E6%9C%88

These two URLs are actually the same, the string that appears after WD is the URL code of "Qin Moon", so we can try to use the default get method to send the request.

#responsible for URL encoding processingImportUrllibImportUrllib2url="http://www.baidu.com/s"Word= {"WD":"Qin Time Moon"}#Convert to URL encoding format (string)Word =Urllib.urlencode (Word)#URL The first delimiter is?Newurl = URL +"?"+Word Headers={"user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/51.0.2704.103 safari/537.36"}request= Urllib2. Request (Newurl, headers=headers) Response=Urllib2.urlopen (Request)PrintResponse.read ()

Code execution results is equivalent to the Baidu search box entered the "Qin Moon" after the return, for GET request, use Urllib urlencode the query character URL encoding and then stitching to the full URL and then send the request.

Second, POST request

The request object has the data parameter, it is used in the post, we are going to transmit this parameter data,data is a dictionary, inside to match the key value pair.

Open Youdao translation http://fanyi.youdao.com/, input test data (python), we found that the Address bar URL has not changed, using the Capture tool can get its post request to the destination address:

As a result, we can try to simulate this post request:

ImportUrllibImportUrllib2#the destination URL for the POST requestURL ="Http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"Headers={"user-agent":"mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) gecko/20100101 firefox/47.0"}formdata={ "i":" python",      
    " from":"AUTO",
　　 " to":"AUTO",           "smartresult":" dict",
　　 "client":" fanyideskweb",     
 　　  "Salt":" 15082966550971",
　　 "sign":" 2a6d78290492d163dbd6803b29e2489c",     
    "DOCTYPE":"JSON",    "version":"2.1",    "Keyfrom":"Fanyi.web",    "Action":"Fy_by_enter",    "Typoresult":"true"}data=Urllib.urlencode (formdata) Request= Urllib2. Request (URL, data = data, headers =headers) Response=Urllib2.urlopen (Request)PrintResponse.read ()

This is a simple post request way, can be based on this idea to write a Youdao translator interface program.

Iii. getting Ajax-loaded content

Some Web content is loaded using AJAX, the direct send request does not get the data, and Ajax generally return JSON, you can directly to the Ajax address post or GET, return the JSON data.

Take the Watercress movie leaderboard For example, https://movie.douban.com/typerank?type_name= plot &type=11&interval_id=100:90&action=, Get the URL of the transfer JSON file by grabbing the package:

This gets the URL, where the start and limit two parameters are the loading of the data sent, we can take it separately as a parameter to pass in:

ImportUrllibImportUrllib2URL="https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action"Headers={"user-agent":"mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) gecko/20100101 firefox/47.0"  }

#These two parameters are changed to show the limit data from start 
Formdata = {
'Start':'0',
'Limit':'Ten'
}

Data=Urllib.urlencode (Formdata)
Request= Urllib2. Request (URL, data = data, headers =headers)
Response=Urllib2.urlopen (Request)
PrintResponse.read ()

For this dynamic page, we want to focus on the source of the data, or you can use selenium and PHANTOMJS to simulate the browser to get the data.

Iv. SSL authentication to handle HTTPS requests

URLLIB2 can verify the SSL certificate for HTTPS requests, just like a Web browser, if the SSL certificate of the Web site is CA-certified, it can be accessed normally, if SSL certificate validation does not pass, or if the operating system does not trust the server's security certificate, For example, when the browser visits the 12306 website such as: https://www.12306.cn/mormhweb/, it warns the user that the certificate is not trusted.

Import"https://www.12306.cn/mormhweb/"= {"  User-agent""mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.99 safari/537.36"= urllib2. Request (URL, headers == urllib2.urlopen (request)print response.read ()

The following error message appears when you run the program:

Urllib2. Urlerror: <urlopen error [ssl:certificate_verify_failed] CERTIFICATE VERIFY FAILED (_ssl.c:661) >

Therefore, if you encounter such a site in the future, we need to handle the SSL certificate separately, let the program ignore the SSL certificate validation error, can be accessed normally.

ImportUrllibImportUrllib2#Import Python SSL processing moduleImportSSL#Ignore unverified SSL certificate authenticationContext =ssl._create_unverified_context () URL="https://www.12306.cn/mormhweb/"Headers= {"user-agent":"mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.99 safari/537.36"}request= Urllib2. Request (url, headers =headers)#Specify the Add context parameter in the Urlopen () methodResponse = Urllib2.urlopen (Request, context =context)PrintResponse.read ()

Python crawler Basics (ii) Get and Post methods for URLLIB2 libraries

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More