25-3 cookie and proxy operations of the requests Module

Last Update:2018-11-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Cookie operations based on the requests Module

Introduction: Sometimes, when we use a crawler to crawl user-related data (crawling the personal homepage data of Michael's "Renren, if we use the regular operations of the previously requests module, we often fail to achieve the desired purpose, for example:

1 #! /Usr/bin/ENV Python 2 #-*-coding: UTF-8-*-3 Import requests 4 if _ name _ = "_ main __": 5 6 # The URL 7 URL of the personal information page of zhangsan Renren is 'HTTP: // www.renren.com/289676607/profile'8 9 # disguise ua10 headers = {11 'user-agent ': 'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) applewebkit/537.36 (khtml, like gecko) Chrome/69.0.3497.100 Safari/1234', 12} 13 # send the request, get Response object 14 response = requests. get (url = URL, headers = headers) 15 # Write the response content to the file 16 with open ('. /renren.html ', 'w', encoding = 'utf-8') as FP: 17 FP. write (response. text)

-The result shows that the data written into the file is not the data on a three-person page, but the homepage of Renren login. Why? First, let's review the concepts and functions of cookies:

-Cookie concept: when a user accesses a domain name through a browser for the first time, the accessed Web server sends data to the client to maintain the status between the Web server and the client. The data is a cookie.

-Cookie function: data exchange is often involved in browsers. For example, you can log on to your mailbox and log on to a page. We often set the remember me within 30 days or the automatic logon option. So how do they record information? The answer is today's leading character cookie, which is set by the HTTP server and saved in the browser, but HTTP is a stateless protocol, after the data exchange is complete, the link between the server and the client is closed. A new link is required for each data exchange. Just like when we go to the supermarket to buy things without a credit card, after we buy things, the supermarket does not have any consumption information, but after we build a credit card, the supermarket has our consumption information. Cookie is like a credit card, which can save points. commodities are our information. supermarkets are like server backend systems. HTTP is the transaction process.

-After the introduction of cookies, you already know why not three personal information pages are crawled in the above case, but the logon page. Then how should we capture Michael's personal information page?

Ideas:

1. We need to use a crawler to capture the cookie data in the request when logging on to the human network.

2. when using the URL of the personal information page for a request, the request must carry the cookie in step 1. The server can only identify the user information of the request after carrying the cookie, response to the specified user information page

1 #! /Usr/bin/ENV Python 2 #-*-coding: UTF-8-*-3 Import requests 4 if _ name _ = "_ main __": 5 6 # login request URL (obtained through the packet capture tool) 7 post_url = 'HTTP: // www.renren.com/ajaxlogin/login? 1 = 1 & uniquetimestamp = 201873958471 '8 # create a session object that automatically stores the cookie in the request and carries 9 session = requests. session () 10 # disguise ua11 headers = {12 'user-agent': 'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) applewebkit/537.36 (khtml, like gecko) chrome/69.0.3497.100 Safari/537.36 ', 13} 14 formdata = {15 'email': '000000', 16 'codec': '', 17 'origin': 'http: // www.renren.com/home', 18 'domain ': 'renren. com ', 19 'key _ id': '1', 20 'captcha _ type': 'web _ login', 21 'Password': 'login ', 22 'rkey': '44fd96c219c593f3c9612360c80310a3 ', 23 'F ': 'https % 3A % 2f % 2fwww.baidu.com % 2 flink % 3 furl % region % 26wd % 3d % 26 eqid % 3dba95daf5000065ce000000035b120219 ', 24} 25 # send a request using session, the objective is to save the session to cookie26 session in the request. post (url = post_url, Data = formdata, headers = headers) 27 28 get_url = 'HTTP: // www.renren.com/960481378/profile'29 # send a request using session again, this request has carried cookie30 response = session. get (url = get_url, headers = headers) 31 # Set the response content encoding format to 32 response. encoding = 'utf-8' 33 # Write the response content to the file 34 with open ('. /renren.html ', 'w') as FP: 35 FP. write (response. text)

2. Proxy operations based on the requests Module

What is proxy
- A proxy is a third party that processes related transactions instead of the ontology. For example, agents in life: purchasing, intermediary, and derivative ......
Why do crawlers need a proxy?
- Some websites have anti-crawler measures. For example, many websites detect the number of visits to an IP address in a certain period of time. If the Access frequency is too fast, it does not look like a normal visitor, it may deny access to this IP address. Therefore, we need to set some proxy IP addresses, and change the proxy IP address every other time. Even if the IP address is disabled, we can still change the IP address to continue crawling.
Proxy category:
- Forward Proxy: the proxy client obtains data. Forward proxy is used to protect clients from being held accountable.
- Reverse Proxy: the proxy server provides data. The reverse proxy is used to protect the server or be responsible for load balancing.
Free proxy IP
- Http://www.goubanjia.com/
- Xsi agent
- Quick proxy

Code

1 #! /Usr/bin/ENV Python 2 #-*-coding: UTF-8-*-3 Import requests 4 import random 5 if _ name _ = "_ main __": 6 # UA 7 header_list = [8 # Travel 9 {"User-Agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0) in different browsers) "}, 10 # Firefox 11 {" User-Agent ":" Mozilla/5.0 (Windows NT 6.1; RV: 2.0.1) Gecko/20100101 Firefox/4.0.1 "}, 12 # Google 13 {14 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) Applewebkit/535.11 (khtml, like gecko) Chrome/17.0.963.56 Safari/535.11 "} 15] 16 # different proxy ip17 proxy_list = [18 {" HTTP ":" 112.115.57.20: 3128 "}, 19 {'http': '2017. 41.171.223: 3128 '} 20] 21 # random UA and proxy ip22 header = random. choice (header_list) 23 proxy = random. choice (proxy_list) 24 25 url = 'HTTP: // www.baidu.com/s? Ie = UTF-8 & WD = ip' 26 # parameter 3: Set proxy 27 response = requests. get (url = URL, headers = header, proxies = proxy) 28 response. encoding = 'utf-8' 29 30 with open('daili.html ', 'wb') as FP: 31 FP. write (response. content) 32 # Switch to the original ip33 requests. get (URL, proxies = {"HTTP": ""}) 34

25-3 cookie and proxy operations of the requests Module

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

25-3 cookie and proxy operations of the requests Module

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

25-3 cookie and proxy operations of the requests Module

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support