Python3 web crawler (iv): Hide identities using the user agent and proxy IP

Source: Internet
Author: User

First, why to set up the user Agent

Some sites do not like to be accessed by the crawler, so will detect the connection object, if it is a crawler, that is, non-human click access, it will not let you continue to access, so in order to allow the program to run properly, you need to hide their own crawler identity. At this point, we can set up the user agent to achieve the purpose of the hidden identity, the user agent in Chinese name for users, referred to as UA.

The user agent resides in headers, and the server determines who is accessing it by looking at the user agent in the headers. In Python, if the user agent is not set, the program will use the default parameters, then the user agent will have the word python, if the server checks the user agent, then the user is not set The agent's Python program will not be able to access the site normally.

Python allows us to modify this user agent to simulate browser access, and it's powerful beyond doubt.

Second, the common user Agent

1.Android

    • mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19
    • mozilla/5.0 (Linux; U Android 4.0.4; EN-GB; gt-i9300 build/imm76d) applewebkit/534.30 (khtml, like Gecko) version/4.0 Mobile safari/534.30
    • mozilla/5.0 (Linux; U Android 2.2; EN-GB; gt-p1000 Build/froyo) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1

2.Firefox

    • mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) gecko/20100101 firefox/21.0
    • mozilla/5.0 (Android; Mobile; rv:14.0) gecko/14.0 firefox/14.0

3.Google Chrome

    • mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/27.0.1453.94 safari/537.36
    • mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus build/imm76b) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.133 Mobile safari/535.19

4.iOS

    • mozilla/5.0 (IPad; CPU os 5_0 like Mac os X applewebkit/534.46 (khtml, like Gecko) version/5.1 mobile/9a334 safari/7534.48.3
    • mozilla/5.0 (IPod; U CPU like Mac OS X; EN) applewebkit/420.1 (khtml, like Gecko) version/3.0 mobile/3a101a safari/419.3

It lists some user agents for Andriod, Firefox, Google Chrome, and iOS, which can be used directly by copy.

Third, the method of setting up the user agent

First look at the Urllib.request.Request ()

As you can see, you can pass in the headers parameter when you create the request object.
Therefore, there are two ways to set up the user Agent:

1. When creating the Request object, fill in the headers parameter (including user agent information), the headers parameter is a dictionary;

2. When creating the Request object, do not add the headers parameter, after creation, use Add_header () method, add headers.

Method One:

To create the file urllib_test09.py, using the first user Agent of Android mentioned above, pass in the headers parameter when creating the Request object, and write the code as follows:

#-*-coding:utf-8-*-from urllib  Import Requestif __name__ =  "__main__":  #以CSDN为例, csdn do not change the user agent is unreachable url =  ' http://www.csdn.net/' Head = {}  #写入User Agent information Head[ ' user-agent '] = " mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19 '  #打印信息 print (HTML)            
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

The results of the operation are as follows:

Method Two:

Create file urllib_test10.py, using the first user Agent of Android mentioned above, do not pass in the headers parameter when creating the Request object, after creation, use the Add_header () method, add headers, Write the following code:

#-*-coding:utf-8-*-from urllib  Import Requestif __name__ =  "__main__":  #以CSDN为例, csdn do not change the user agent is unreachable url =  ' http://www.csdn.net/'  #创建Request对象 req = Request. Request (URL)  #传入headers Req.add_header ( ' user-agent ',  ' mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 build/jro03d) applewebkit/535.19 (khtml, like Gecko) chrome/18.0.1025.166 safari/535.19 ')  #传入创建好的Request对象 response = Request.urlopen (req)  #读取响应信息并解码 HTML = Response.read (). Decode ( ' utf-8 ')  #打印信息 print (HTML)  
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

The result of the operation is the same as the previous method.

Iv. Use of IP proxies

1. Why Use IP Proxy

The User agent has been set up, but should also consider a problem, the program is running fast, if we use a crawler to crawl things on the site, a fixed IP access will be very high, this does not meet the standards of human operation, because the human operation is not possible within a few MS, For such a frequent visit. So some sites will set a threshold for IP access frequency, if an IP access frequency exceeds this threshold, it means that this is not a person in the access, but a reptile program.

2. General Step Description

A very simple solution is to set the delay, but this obviously does not meet the purpose of crawling fast crawler information, so another better way is to use IP proxy. Steps to use the proxy:

(1) call Urlib.request.ProxyHandler () and the proxies parameter is a dictionary.

(2) create opener (similar to Urlopen, this is our own custom-made way)

(3) installation opener

After you use the Install_opener method, the default Urlopen method of the program is replaced. That is, if you use Install_opener, in that file, calling Urlopen again will use the opener that you created. If you do not want to replace it, just want to use it temporarily, you can use the Opener.open (URL), so that it will not affect the default program Urlopen.

3. Proxy IP Selection

Before writing the code, in the proxy IP site to select an IP address, recommended West Thorn proxy IP.

url:http://www.xicidaili.com/

Note: Of course, you can also write a regular expression from the site directly crawl IP, but remember not to crawl too often, add a delay what, too often to the server pressure, the server will directly put you block, do not let you visit, I was sealed for two days.

From the West Thorn website to choose the signal good IP, my choice is as follows: (106.46.136.112:808)

Write code to access http://www.whatismyip.com.tw/, which is the Web site that tests how much IP you have, and the server returns the IP of the visitor.

4. Code examples

To create the file urllib_test11.py, write the following code:

#-*-Coding:utf-8-*-From UrllibImport Requestif __name__ = ="__main__":#访问网址 URL = ' http://www.whatismyip.com.tw/'  #这是代理IP proxy = { ' http ':  ' 106.46.136.112:808 '}  #创建ProxyHandler Proxy_support = Request. Proxyhandler (proxy)  #创建Opener opener = Request.build_opener (proxy_support)  #添加User angent opener.addheaders = [( ' user-agent ',  mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/56.0.2924.87 safari/537.36 ')]  #安装OPener Request.install_opener (opener)  #使用自己安装好的Opener response = Request.urlopen (URL)  #读取相应信息并解码 HTML = Response.read (). Decode ( "Utf-8")  #打印信息 print (HTML)            
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21st
    • 22

The results of the operation are as follows:

As you can see, the IP access has been disguised as 106.46.136.112.

Python3 web crawler (iv): Hide identities using the user agent and proxy IP

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.