Python3 Web crawler (3): Hide identities using the user agent and proxy IP

Last Update:2018-03-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python version: Python3

ide:pycharm2017.3.3

First, why to set up the user Agent

Some sites do not like to be accessed by the crawler, so will detect the object, if it is a crawler, he will not let you access, by setting up the user agent to achieve the purpose of hidden identity, the user agent's Chinese name for the users agents, referred to as UA

The user agent resides in headers, and the server determines who is accessing it by looking at the user agent in the headers. In Python, if you do not set the user agent, the program will be private default parameters, then the user agent will have Python words, anti-crawling site will not let you access

Python allows us to modify this user agent to simulate browser access

Second, the common user Agent

1.Android

mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19
mozilla/5.0 (Linux; U Android 4.0.4; EN-GB; gt-i9300 build/imm76d) applewebkit/534.30 (khtml, like Gecko) version/4.0 Mobile safari/534.30
mozilla/5.0 (Linux; U Android 2.2; EN-GB; gt-p1000 Build/froyo) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1

2.Firefox

mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) gecko/20100101 firefox/21.0
mozilla/5.0 (Android; Mobile; rv:14.0) gecko/14.0 firefox/14.0

3.Google Chrome

mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/27.0.1453.94 safari/537.36
mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus build/imm76b) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.133 Mobile safari/535.19

4.iOS

mozilla/5.0 (IPad; CPU os 5_0 like Mac os X applewebkit/534.46 (khtml, like Gecko) version/5.1 mobile/9a334 safari/7534.48.3
mozilla/5.0 (IPod; U CPU like Mac OS X; EN) applewebkit/420.1 (khtml, like Gecko) version/3.0 mobile/3a101a safari/419.3

These user agents, direct copy can be used

Third, the method of setting up the user agent

Method One:

Using the first user Agent on Android above, when creating the request object, pass in the headers parameter, the code is as follows

1  fromUrllibImportRequest2 3 #take csdn as an example, CSDN does not change the user agent is inaccessible4URL ="http://www.csdn.net/"5Head = {}6 #Write user agent information7head['user-agent'] ='mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19'8 #Create a Request object9req = Request. Request (URL, headers=head)Ten #passing in the created Request object OneResponse =Request.urlopen (req) A #read response information and decode -html = Response.read (). Decode ('Utf-8') - #Printing Information the Print(HTML)

The results are as follows

Method Two:

Using the first user Agent on Android above, do not pass in the headers parameter when creating the Request object, after creation use the Add_header () method, add headers, the code is as follows

 fromUrllibImportRequesturl='http://www.csdn.net/'#Create a Request objectreq =request. Request (URL)#Incoming HeadersReq.add_header ('user-agent','mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19')#passing in the created Request objectResponse =Request.urlopen (req)#read response information and decodehtml = Response.read (). Decode ('Utf-8')Print(HTML)

The running result is the same as the previous

Iv. Use of IP proxies

1. Why Use IP Proxy

The speed of the program is very fast, if we use a crawler to crawl things on the site, a fixed IP access will be very high, this does not conform to the standard of human operation, because human operations can not be in a few milliseconds, so frequent access. So some sites will set a threshold for IP access frequency, if an IP access exceeds this threshold, it means that this is not a person in the access, but a reptile program

2. Steps

(1) Call Urllib.request.ProxyHandler (), the proxies parameter is a dictionary

(2) Create opener (similar to Urlopen, this way of generation to make our own custom)

(3) Installation opener

3. Proxy IP Selection

West Thorn Proxy IP Select a 111.155.116.249 from

4. The code is as follows

1  fromUrllibImportRequest2 3 if __name__=="__main__":4     #Visit URL5URL ='http://www.whatismyip.com.tw/'6     #This is the proxy IP7Proxy = {'http':'60.184.175.145'}8     #Create Proxyhandler9Proxy_support =request. Proxyhandler (proxy)Ten     #Create opener OneOpener =Request.build_opener (Proxy_support) A     #Add user angent -Opener.addheaders = [('user-agent','mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19')] -     #Installing opener the Request.install_opener (opener) -     #use your own installed opener -Response =request.urlopen (URL) -     #read the corresponding information and decode +html = Response.read (). Decode ("Utf-8") -     #Printing Information +     Print(HTML)

Python3 Web crawler (3): Hide identities using the user agent and proxy IP

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python3 Web crawler (3): Hide identities using the user agent and proxy IP

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python3 Web crawler (3): Hide identities using the user agent and proxy IP

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support