Python3 Web crawler (3): Hide identities using the user agent and proxy IP

Source: Internet
Author: User

Python version: Python3

ide:pycharm2017.3.3

First, why to set up the user Agent

Some sites do not like to be accessed by the crawler, so will detect the object, if it is a crawler, he will not let you access, by setting up the user agent to achieve the purpose of hidden identity, the user agent's Chinese name for the users agents, referred to as UA

The user agent resides in headers, and the server determines who is accessing it by looking at the user agent in the headers. In Python, if you do not set the user agent, the program will be private default parameters, then the user agent will have Python words, anti-crawling site will not let you access

Python allows us to modify this user agent to simulate browser access

Second, the common user Agent

1.Android

    • mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19
    • mozilla/5.0 (Linux; U Android 4.0.4; EN-GB; gt-i9300 build/imm76d) applewebkit/534.30 (khtml, like Gecko) version/4.0 Mobile safari/534.30
    • mozilla/5.0 (Linux; U Android 2.2; EN-GB; gt-p1000 Build/froyo) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1

2.Firefox

    • mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) gecko/20100101 firefox/21.0
    • mozilla/5.0 (Android; Mobile; rv:14.0) gecko/14.0 firefox/14.0

3.Google Chrome

    • mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/27.0.1453.94 safari/537.36
    • mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus build/imm76b) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.133 Mobile safari/535.19

4.iOS

    • mozilla/5.0 (IPad; CPU os 5_0 like Mac os X applewebkit/534.46 (khtml, like Gecko) version/5.1 mobile/9a334 safari/7534.48.3
    • mozilla/5.0 (IPod; U CPU like Mac OS X; EN) applewebkit/420.1 (khtml, like Gecko) version/3.0 mobile/3a101a safari/419.3

These user agents, direct copy can be used

Third, the method of setting up the user agent

Method One:

Using the first user Agent on Android above, when creating the request object, pass in the headers parameter, the code is as follows

1  fromUrllibImportRequest2 3 #take csdn as an example, CSDN does not change the user agent is inaccessible4URL ="http://www.csdn.net/"5Head = {}6 #Write user agent information7head['user-agent'] ='mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19'8 #Create a Request object9req = Request. Request (URL, headers=head)Ten #passing in the created Request object OneResponse =Request.urlopen (req) A #read response information and decode -html = Response.read (). Decode ('Utf-8') - #Printing Information the Print(HTML)

The results are as follows

Method Two:

Using the first user Agent on Android above, do not pass in the headers parameter when creating the Request object, after creation use the Add_header () method, add headers, the code is as follows

 fromUrllibImportRequesturl='http://www.csdn.net/'#Create a Request objectreq =request. Request (URL)#Incoming HeadersReq.add_header ('user-agent','mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19')#passing in the created Request objectResponse =Request.urlopen (req)#read response information and decodehtml = Response.read (). Decode ('Utf-8')Print(HTML)

The running result is the same as the previous

Iv. Use of IP proxies

1. Why Use IP Proxy

The speed of the program is very fast, if we use a crawler to crawl things on the site, a fixed IP access will be very high, this does not conform to the standard of human operation, because human operations can not be in a few milliseconds, so frequent access. So some sites will set a threshold for IP access frequency, if an IP access exceeds this threshold, it means that this is not a person in the access, but a reptile program

2. Steps

(1) Call Urllib.request.ProxyHandler (), the proxies parameter is a dictionary

(2) Create opener (similar to Urlopen, this way of generation to make our own custom)

(3) Installation opener

3. Proxy IP Selection

West Thorn Proxy IP Select a 111.155.116.249 from

4. The code is as follows

1  fromUrllibImportRequest2 3 if __name__=="__main__":4     #Visit URL5URL ='http://www.whatismyip.com.tw/'6     #This is the proxy IP7Proxy = {'http':'60.184.175.145'}8     #Create Proxyhandler9Proxy_support =request. Proxyhandler (proxy)Ten     #Create opener OneOpener =Request.build_opener (Proxy_support) A     #Add user angent -Opener.addheaders = [('user-agent','mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19')] -     #Installing opener the Request.install_opener (opener) -     #use your own installed opener -Response =request.urlopen (URL) -     #read the corresponding information and decode +html = Response.read (). Decode ("Utf-8") -     #Printing Information +     Print(HTML)

Python3 Web crawler (3): Hide identities using the user agent and proxy IP

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.