Use the Python crawler proxy IP address to quickly increase the reading volume of blogs and python Crawlers

Last Update:2016-12-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

The question is not an aim, but mainly for a more detailed understanding of the anti-crawling mechanism of the website. If you really want to increase the reading volume of your blog, high-quality content is essential.

Measure the test taker's understanding about the anti-crawling mechanism of the website.

Websites generally use anti-crawler in the following aspects:

1. Anti-crawler through Headers

Headers anti-crawler in user requests is the most common anti-crawler policy. Many websites detect Headers User-Agent, and some websites detect Referer (some resource websites refer to Referer for anti-leech detection ).

If you encounter this type of Anti-crawler mechanism, you can directly add Headers to the crawler, copy the browser's User-Agent to the crawler's Headers, or change the Referer value to the target website domain name. For anti-crawler that detects Headers, you can modify or add Headers in the crawler to bypass it.

2. Anti-crawler based on user behavior

Some websites detect user behaviors, such as multiple accesses to the same page from the same IP address within a short period of time, or multiple operations on the same account within a short period of time.

Most websites are in the previous situation. In this case, IP proxy can be used. We can store the proxy IP address in the file after detection, but this method is not desirable and the possibility of proxy IP address failure is very high. Therefore, we can capture the proxy IP address from the website in real time, is a good choice.

In the second case, you can perform the next request at random intervals of several seconds after each request. For websites with logical vulnerabilities, you can log out and log on again after several requests. You can continue to send requests to bypass the restriction of multiple requests for the same account in a short time.

In addition, cookies are used to determine whether a user is a valid user. This technology is often used for websites that require logon. Further down, some websites will dynamically update the verification during logon. For example, during login, The authenticity_token used for Logon verification will be randomly allocated, authenticity_token will be sent back to the server together with the user-submitted login name and password.

3. dynamic page-based anti-Crawler

Sometimes the target page is captured and the key information is blank. Only the Framework Code is available. This is because the website information is dynamically returned through the XHR Post, to solve this problem, you can use the developer tool (FireBug, etc.) to analyze the website stream, find the independent content information request (such as Json), crawl the content information, and obtain the required content.

What's more complex is the encryption of dynamic requests. parameters cannot be parsed and therefore cannot be crawled. In this case, you can call the browser kernel through the mechanism ize, selenium RC, just as you actually use the browser to access the Internet, to maximize the capture success, but the efficiency will be discounted. I have tested that it takes more than 30 seconds to capture 30 pages of recruitment information using urllib, while 2-3 minutes to capture the kernel using a simulated browser.

4. Restrict access from certain IP addresses

Free proxy IP addresses can be obtained from many websites. Since crawlers can use these proxy IP addresses to capture websites, websites can also use these proxy IP addresses to reverse restrict access, you can capture these IP addresses and save them on the server to restrict crawlers that use the proxy IP addresses.

Enter the subject

Now, let's write a crawler that uses a proxy IP to access the website.

First, obtain the proxy IP address for capturing.

Def Get_proxy_ip (): headers = {'host': 'www .xicidaili.com ', 'user-agent': 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0 )', 'accept': r'application/json, text/javascript, */*; q = 000000', 'Referer': r'http: // www.xicidaili.com /',} req = request. request (r 'HTTP: // www.xicidaili.com/nn/', headers = headers) # The website that publishes the proxy IP address response = request. urlopen (req) html = response. read (). decode ('utf-8') proxy_list = [] ip_list = re. findall (R' \ d + \. \ d + \. \ d + \. \ d + ', html) port_list = re. findall (R' <td> \ d + </td> ', html) for I in range (len (ip_list): ip = ip_list [I] port = re. sub (R' <td >|</td> ', '', port_list [I]) proxy =' % s: % s' % (ip, port) proxy_list.append (proxy) return proxy_list

By the way, some websites restrict crawling by checking the real IP address of the proxy IP. Here we will give you a little bit of knowledge about proxy IP addresses.

What are the meanings of "Transparent" "anonymous" and "high availability" in the proxy IP addresses?

Transparent proxy means that the client does not need to know the existence of a proxy server, but it still transmits the real IP address. By using a transparent IP address, you cannot bypass the limit on the number of IP access requests within a certain period of time.

Normal anonymous proxy can hide the real IP address of the client, but it will change our request information. The server may think that we have used the proxy. However, when using this proxy, although the accessed website does not know your IP address, you can still know that you are using the proxy, so that the IP address will be banned from accessing the website.

The high anonymous proxy does not change the client's request. In this way, the server looks like a real client browser is accessing it. In this case, the customer's real IP address is hidden, the website will not think we are using a proxy.

To sum up, it is best to use the "high-bandwidth IP Address" as the crawler proxy IP address"

User_agent_list contains the user-agent of RequestHeaders requested by mainstream browsers. Through this, We can imitate various browser requests.

user_agent_list = [ 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '  'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)', 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1', 'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0', 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',]

By setting a random wait time to access the website, you can bypass the request interval restrictions of some websites.

Def Proxy_read (proxy_list, user_agent_list, I): proxy_ip = proxy_list [I] print ('current proxy ip: % s' % proxy_ip) user_agent = random. choice (user_agent_list) print ('current proxy user_agent: % s' % user_agent) sleep_time = random. randint (1, 3) print ('wait time: % s '% sleep_time) time. sleep (sleep_time) # Set the random wait time print ('get started ') headers = {'host': 's9 -im-policy.csdn.net', 'origin': 'http: // blog.csdn.net ', 'User-agent': user_agent, 'ac Cept ': r'application/json, text/javascript, */*; q = 100', 'Referer': r'http: // response,} proxy_support = request. proxyHandler ({'HTTP ': proxy_ip}) opener = request. build_opener (proxy_support) request. install_opener (opener) req = request. request (r 'HTTP: // blog.csdn.net/u042520031/article/details/51068703', headers#headers) try: html = request. urlopen (req ). read (). Decode ('utf-8') failed t Exception as e: print ('****** open failed! * ***** ') Else: global count + = 1 print (' OK! Total % s successful! '% Count)

The above are the knowledge points related to the use of proxies by crawlers. Although they are still very simple, they can be used in most scenarios.

Add source code

#! /Usr/bin/env python3from urllib import requestimport randomimport timeimport lxmlimport reuser_agent_list = ['mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ''chrome/45.0.2454.85 Safari/537.36 115 Browser/6.0.3 ', 'mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50 ', 'mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50 ', 'mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0) ', 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)', 'mozilla/5.0 (Windows NT 6.1; rv: 2.0.1) gecko/20100101 Firefox/4.0.1 ', 'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/100', 'mozilla/11.11 (Macintosh; intel Mac OS X 10_7_0) AppleWebKi T/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11 ', 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2. x MetaSr 1.0 ;. net clr 2.0.50727; SE 2.X MetaSr 1.0) ', 'mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/100', 'mozilla/5.0 (Windows NT 5.0; rv: 2.0.1) Gecko/20100101 Firefox/4.0.1 ',] count = 0def Get_proxy_ip (): headers = {'host': 'www. x Icidaili.com ', 'user-agent': 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)', 'accept': r'application/json, text/javascript, */*; q = 0.01 ', 'Referer': r'http: // www.xicidaili.com/',} req = request. request (r 'HTTP: // www.xicidaili.com/nn/', headers = headers) response = request. urlopen (req) html = response. read (). decode ('utf-8') proxy_list = [] ip_list = re. findall (R' \ d + \. \ d + \. \ d + \. \ d + ', html) port_l Ist = re. findall (R' <td> \ d + </td> ', html) for I in range (len (ip_list): ip = ip_list [I] port = re. sub (R' <td >|</td> ', '', port_list [I]) proxy =' % s: % s' % (ip, port) principal (proxy) return proxy_listdef Proxy_read (proxy_list, user_agent_list, I): proxy_ip = proxy_list [I] print ('current proxy ip: % s' % proxy_ip) user_agent = random. choice (user_agent_list) print ('current proxy user_agent: % s' % user_agent) sleep_time = ran Dom. randint (1, 3) print ('wait time: % s '% sleep_time) time. sleep (sleep_time) print ('get started ') headers = {'host': 's9 -im-policy.csdn.net', 'origin': 'http: // blog.csdn.net ', 'User-agent': user_agent, 'accept': r'application/json, text/javascript, */*; q = 6666', 'Referer': r'http: // blog.csdn.net/u037920031/article/details/51068703',} proxy_support = request. proxyHandler ({'HTTP ': proxy_ip}) opener = request. bui Ld_opener (proxy_support) request. install_opener (opener) req = request. request (r 'HTTP: // blog.csdn.net/u042520031/article/details/51068703', headers#headers) try: html = request. urlopen (req ). read (). decode ('utf-8') failed t Exception as e: print ('****** open failed! * ***** ') Else: global count + = 1 print (' OK! Total % s successful! '% Count) if _ name _ =' _ main _ ': proxy_list = Get_proxy_ip () for I in range (100): Proxy_read (proxy_list, user_agent_list, i)

The above is all the content of this article. I hope this article will help you in your study or work. I also hope to provide more support to the customer's home!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use the Python crawler proxy IP address to quickly increase the reading volume of blogs and python Crawlers

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use the Python crawler proxy IP address to quickly increase the reading volume of blogs and python Crawlers

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support