Python version: Python3
ide:pycharm2017.3.3
First, why to set up the user Agent
Some sites do not like to be accessed by the crawler, so will detect the object, if it is a crawler, he will not let you access, by setting up the user agent to achieve the purpose of hidden identity, the user agent's Chinese name for the users agents, referred to as UA
The user agent resides in headers, and the server determines who is accessing it by looking at the user agent in the headers. In Python, if you do not set the user agent, the program will be private default parameters, then the user agent will have Python words, anti-crawling site will not let you access
Python allows us to modify this user agent to simulate browser access
Second, the common user Agent
1.Android
- mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19
- mozilla/5.0 (Linux; U Android 4.0.4; EN-GB; gt-i9300 build/imm76d) applewebkit/534.30 (khtml, like Gecko) version/4.0 Mobile safari/534.30
- mozilla/5.0 (Linux; U Android 2.2; EN-GB; gt-p1000 Build/froyo) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1
2.Firefox
- mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) gecko/20100101 firefox/21.0
- mozilla/5.0 (Android; Mobile; rv:14.0) gecko/14.0 firefox/14.0
3.Google Chrome
- mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/27.0.1453.94 safari/537.36
- mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus build/imm76b) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.133 Mobile safari/535.19
4.iOS
- mozilla/5.0 (IPad; CPU os 5_0 like Mac os X applewebkit/534.46 (khtml, like Gecko) version/5.1 mobile/9a334 safari/7534.48.3
- mozilla/5.0 (IPod; U CPU like Mac OS X; EN) applewebkit/420.1 (khtml, like Gecko) version/3.0 mobile/3a101a safari/419.3
These user agents, direct copy can be used
Third, the method of setting up the user agent
Method One:
Using the first user Agent on Android above, when creating the request object, pass in the headers parameter, the code is as follows
1 fromUrllibImportRequest2 3 #take csdn as an example, CSDN does not change the user agent is inaccessible4URL ="http://www.csdn.net/"5Head = {}6 #Write user agent information7head['user-agent'] ='mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19'8 #Create a Request object9req = Request. Request (URL, headers=head)Ten #passing in the created Request object OneResponse =Request.urlopen (req) A #read response information and decode -html = Response.read (). Decode ('Utf-8') - #Printing Information the Print(HTML)
The results are as follows
Method Two:
Using the first user Agent on Android above, do not pass in the headers parameter when creating the Request object, after creation use the Add_header () method, add headers, the code is as follows
fromUrllibImportRequesturl='http://www.csdn.net/'#Create a Request objectreq =request. Request (URL)#Incoming HeadersReq.add_header ('user-agent','mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19')#passing in the created Request objectResponse =Request.urlopen (req)#read response information and decodehtml = Response.read (). Decode ('Utf-8')Print(HTML)
The running result is the same as the previous
Iv. Use of IP proxies
1. Why Use IP Proxy
The speed of the program is very fast, if we use a crawler to crawl things on the site, a fixed IP access will be very high, this does not conform to the standard of human operation, because human operations can not be in a few milliseconds, so frequent access. So some sites will set a threshold for IP access frequency, if an IP access exceeds this threshold, it means that this is not a person in the access, but a reptile program
2. Steps
(1) Call Urllib.request.ProxyHandler (), the proxies parameter is a dictionary
(2) Create opener (similar to Urlopen, this way of generation to make our own custom)
(3) Installation opener
3. Proxy IP Selection
West Thorn Proxy IP Select a 111.155.116.249 from
4. The code is as follows
1 fromUrllibImportRequest2 3 if __name__=="__main__":4 #Visit URL5URL ='http://www.whatismyip.com.tw/'6 #This is the proxy IP7Proxy = {'http':'60.184.175.145'}8 #Create Proxyhandler9Proxy_support =request. Proxyhandler (proxy)Ten #Create opener OneOpener =Request.build_opener (Proxy_support) A #Add user angent -Opener.addheaders = [('user-agent','mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19')] - #Installing opener the Request.install_opener (opener) - #use your own installed opener -Response =request.urlopen (URL) - #read the corresponding information and decode +html = Response.read (). Decode ("Utf-8") - #Printing Information + Print(HTML)
Python3 Web crawler (3): Hide identities using the user agent and proxy IP