1. Cookie operations based on the requests Module
Introduction: Sometimes, when we use a crawler to crawl user-related data (crawling the personal homepage data of Michael's "Renren, if we use the regular operations of the previously requests module, we often fail to achieve the desired purpose, for example:
1 #! /Usr/bin/ENV Python 2 #-*-coding: UTF-8-*-3 Import requests 4 if _ name _ = "_ main __": 5 6 # The URL 7 URL of the personal information page of zhangsan Renren is 'HTTP: // www.renren.com/289676607/profile'8 9 # disguise ua10 headers = {11 'user-agent ': 'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) applewebkit/537.36 (khtml, like gecko) Chrome/69.0.3497.100 Safari/1234', 12} 13 # send the request, get Response object 14 response = requests. get (url = URL, headers = headers) 15 # Write the response content to the file 16 with open ('. /renren.html ', 'w', encoding = 'utf-8') as FP: 17 FP. write (response. text)
-The result shows that the data written into the file is not the data on a three-person page, but the homepage of Renren login. Why? First, let's review the concepts and functions of cookies:
-Cookie concept: when a user accesses a domain name through a browser for the first time, the accessed Web server sends data to the client to maintain the status between the Web server and the client. The data is a cookie.
-Cookie function: data exchange is often involved in browsers. For example, you can log on to your mailbox and log on to a page. We often set the remember me within 30 days or the automatic logon option. So how do they record information? The answer is today's leading character cookie, which is set by the HTTP server and saved in the browser, but HTTP is a stateless protocol, after the data exchange is complete, the link between the server and the client is closed. A new link is required for each data exchange. Just like when we go to the supermarket to buy things without a credit card, after we buy things, the supermarket does not have any consumption information, but after we build a credit card, the supermarket has our consumption information. Cookie is like a credit card, which can save points. commodities are our information. supermarkets are like server backend systems. HTTP is the transaction process.
-After the introduction of cookies, you already know why not three personal information pages are crawled in the above case, but the logon page. Then how should we capture Michael's personal information page?
Ideas:
1. We need to use a crawler to capture the cookie data in the request when logging on to the human network.
2. when using the URL of the personal information page for a request, the request must carry the cookie in step 1. The server can only identify the user information of the request after carrying the cookie, response to the specified user information page
1 #! /Usr/bin/ENV Python 2 #-*-coding: UTF-8-*-3 Import requests 4 if _ name _ = "_ main __": 5 6 # login request URL (obtained through the packet capture tool) 7 post_url = 'HTTP: // www.renren.com/ajaxlogin/login? 1 = 1 & uniquetimestamp = 201873958471 '8 # create a session object that automatically stores the cookie in the request and carries 9 session = requests. session () 10 # disguise ua11 headers = {12 'user-agent': 'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) applewebkit/537.36 (khtml, like gecko) chrome/69.0.3497.100 Safari/537.36 ', 13} 14 formdata = {15 'email': '000000', 16 'codec': '', 17 'origin': 'http: // www.renren.com/home', 18 'domain ': 'renren. com ', 19 'key _ id': '1', 20 'captcha _ type': 'web _ login', 21 'Password': 'login ', 22 'rkey': '44fd96c219c593f3c9612360c80310a3 ', 23 'F ': 'https % 3A % 2f % 2fwww.baidu.com % 2 flink % 3 furl % region % 26wd % 3d % 26 eqid % 3dba95daf5000065ce000000035b120219 ', 24} 25 # send a request using session, the objective is to save the session to cookie26 session in the request. post (url = post_url, Data = formdata, headers = headers) 27 28 get_url = 'HTTP: // www.renren.com/960481378/profile'29 # send a request using session again, this request has carried cookie30 response = session. get (url = get_url, headers = headers) 31 # Set the response content encoding format to 32 response. encoding = 'utf-8' 33 # Write the response content to the file 34 with open ('. /renren.html ', 'w') as FP: 35 FP. write (response. text)
2. Proxy operations based on the requests Module
- What is proxy
A proxy is a third party that processes related transactions instead of the ontology. For example, agents in life: purchasing, intermediary, and derivative ......
Why do crawlers need a proxy?
Some websites have anti-crawler measures. For example, many websites detect the number of visits to an IP address in a certain period of time. If the Access frequency is too fast, it does not look like a normal visitor, it may deny access to this IP address. Therefore, we need to set some proxy IP addresses, and change the proxy IP address every other time. Even if the IP address is disabled, we can still change the IP address to continue crawling.
Proxy category:
Forward Proxy: the proxy client obtains data. Forward proxy is used to protect clients from being held accountable.
Reverse Proxy: the proxy server provides data. The reverse proxy is used to protect the server or be responsible for load balancing.
Free proxy IP
Code
1 #! /Usr/bin/ENV Python 2 #-*-coding: UTF-8-*-3 Import requests 4 import random 5 if _ name _ = "_ main __": 6 # UA 7 header_list = [8 # Travel 9 {"User-Agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0) in different browsers) "}, 10 # Firefox 11 {" User-Agent ":" Mozilla/5.0 (Windows NT 6.1; RV: 2.0.1) Gecko/20100101 Firefox/4.0.1 "}, 12 # Google 13 {14 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) Applewebkit/535.11 (khtml, like gecko) Chrome/17.0.963.56 Safari/535.11 "} 15] 16 # different proxy ip17 proxy_list = [18 {" HTTP ":" 112.115.57.20: 3128 "}, 19 {'http': '2017. 41.171.223: 3128 '} 20] 21 # random UA and proxy ip22 header = random. choice (header_list) 23 proxy = random. choice (proxy_list) 24 25 url = 'HTTP: // www.baidu.com/s? Ie = UTF-8 & WD = ip' 26 # parameter 3: Set proxy 27 response = requests. get (url = URL, headers = header, proxies = proxy) 28 response. encoding = 'utf-8' 29 30 with open('daili.html ', 'wb') as FP: 31 FP. write (response. content) 32 # Switch to the original ip33 requests. get (URL, proxies = {"HTTP": ""}) 34
25-3 cookie and proxy operations of the requests Module