Use the Python crawler proxy IP address to quickly increase the reading volume of blogs and python Crawlers
Preface
The question is not an aim, but mainly for a more detailed understanding of the anti-crawling mechanism of the website. If you really want to increase the reading volume of your blog, high-quality content is essential.
Measure the test taker's understanding about the anti-crawling mechanism of the website.
Websites generally use anti-crawler in the following aspects:
1. Anti-crawler through Headers
Headers anti-crawler in user requests is the most common anti-crawler policy. Many websites detect Headers User-Agent, and some websites detect Referer (some resource websites refer to Referer for anti-leech detection ).
If you encounter this type of Anti-crawler mechanism, you can directly add Headers to the crawler, copy the browser's User-Agent to the crawler's Headers, or change the Referer value to the target website domain name. For anti-crawler that detects Headers, you can modify or add Headers in the crawler to bypass it.
2. Anti-crawler based on user behavior
Some websites detect user behaviors, such as multiple accesses to the same page from the same IP address within a short period of time, or multiple operations on the same account within a short period of time.
Most websites are in the previous situation. In this case, IP proxy can be used. We can store the proxy IP address in the file after detection, but this method is not desirable and the possibility of proxy IP address failure is very high. Therefore, we can capture the proxy IP address from the website in real time, is a good choice.
In the second case, you can perform the next request at random intervals of several seconds after each request. For websites with logical vulnerabilities, you can log out and log on again after several requests. You can continue to send requests to bypass the restriction of multiple requests for the same account in a short time.
In addition, cookies are used to determine whether a user is a valid user. This technology is often used for websites that require logon. Further down, some websites will dynamically update the verification during logon. For example, during login, The authenticity_token used for Logon verification will be randomly allocated, authenticity_token will be sent back to the server together with the user-submitted login name and password.
3. dynamic page-based anti-Crawler
Sometimes the target page is captured and the key information is blank. Only the Framework Code is available. This is because the website information is dynamically returned through the XHR Post, to solve this problem, you can use the developer tool (FireBug, etc.) to analyze the website stream, find the independent content information request (such as Json), crawl the content information, and obtain the required content.
What's more complex is the encryption of dynamic requests. parameters cannot be parsed and therefore cannot be crawled. In this case, you can call the browser kernel through the mechanism ize, selenium RC, just as you actually use the browser to access the Internet, to maximize the capture success, but the efficiency will be discounted. I have tested that it takes more than 30 seconds to capture 30 pages of recruitment information using urllib, while 2-3 minutes to capture the kernel using a simulated browser.
4. Restrict access from certain IP addresses
Free proxy IP addresses can be obtained from many websites. Since crawlers can use these proxy IP addresses to capture websites, websites can also use these proxy IP addresses to reverse restrict access, you can capture these IP addresses and save them on the server to restrict crawlers that use the proxy IP addresses.
Enter the subject
Now, let's write a crawler that uses a proxy IP to access the website.
First, obtain the proxy IP address for capturing.
Def Get_proxy_ip (): headers = {'host': 'www .xicidaili.com ', 'user-agent': 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0 )', 'accept': r'application/json, text/javascript, */*; q = 000000', 'Referer': r'http: // www.xicidaili.com /',} req = request. request (r 'HTTP: // www.xicidaili.com/nn/', headers = headers) # The website that publishes the proxy IP address response = request. urlopen (req) html = response. read (). decode ('utf-8') proxy_list = [] ip_list = re. findall (R' \ d + \. \ d + \. \ d + \. \ d + ', html) port_list = re. findall (R' <td> \ d + </td> ', html) for I in range (len (ip_list): ip = ip_list [I] port = re. sub (R' <td >|</td> ', '', port_list [I]) proxy =' % s: % s' % (ip, port) proxy_list.append (proxy) return proxy_list
By the way, some websites restrict crawling by checking the real IP address of the proxy IP. Here we will give you a little bit of knowledge about proxy IP addresses.
What are the meanings of "Transparent" "anonymous" and "high availability" in the proxy IP addresses?
Transparent proxy means that the client does not need to know the existence of a proxy server, but it still transmits the real IP address. By using a transparent IP address, you cannot bypass the limit on the number of IP access requests within a certain period of time.
Normal anonymous proxy can hide the real IP address of the client, but it will change our request information. The server may think that we have used the proxy. However, when using this proxy, although the accessed website does not know your IP address, you can still know that you are using the proxy, so that the IP address will be banned from accessing the website.
The high anonymous proxy does not change the client's request. In this way, the server looks like a real client browser is accessing it. In this case, the customer's real IP address is hidden, the website will not think we are using a proxy.
To sum up, it is best to use the "high-bandwidth IP Address" as the crawler proxy IP address"
User_agent_list contains the user-agent of RequestHeaders requested by mainstream browsers. Through this, We can imitate various browser requests.
user_agent_list = [ 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)', 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1', 'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0', 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',]
By setting a random wait time to access the website, you can bypass the request interval restrictions of some websites.
Def Proxy_read (proxy_list, user_agent_list, I): proxy_ip = proxy_list [I] print ('current proxy ip: % s' % proxy_ip) user_agent = random. choice (user_agent_list) print ('current proxy user_agent: % s' % user_agent) sleep_time = random. randint (1, 3) print ('wait time: % s '% sleep_time) time. sleep (sleep_time) # Set the random wait time print ('get started ') headers = {'host': 's9 -im-policy.csdn.net', 'origin': 'http: // blog.csdn.net ', 'User-agent': user_agent, 'ac Cept ': r'application/json, text/javascript, */*; q = 100', 'Referer': r'http: // response,} proxy_support = request. proxyHandler ({'HTTP ': proxy_ip}) opener = request. build_opener (proxy_support) request. install_opener (opener) req = request. request (r 'HTTP: // blog.csdn.net/u042520031/article/details/51068703', headers#headers) try: html = request. urlopen (req ). read (). Decode ('utf-8') failed t Exception as e: print ('****** open failed! * ***** ') Else: global count + = 1 print (' OK! Total % s successful! '% Count)
The above are the knowledge points related to the use of proxies by crawlers. Although they are still very simple, they can be used in most scenarios.
Add source code
#! /Usr/bin/env python3from urllib import requestimport randomimport timeimport lxmlimport reuser_agent_list = ['mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ''chrome/45.0.2454.85 Safari/537.36 115 Browser/6.0.3 ', 'mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50 ', 'mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50 ', 'mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0) ', 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)', 'mozilla/5.0 (Windows NT 6.1; rv: 2.0.1) gecko/20100101 Firefox/4.0.1 ', 'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/100', 'mozilla/11.11 (Macintosh; intel Mac OS X 10_7_0) AppleWebKi T/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11 ', 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2. x MetaSr 1.0 ;. net clr 2.0.50727; SE 2.X MetaSr 1.0) ', 'mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/100', 'mozilla/5.0 (Windows NT 5.0; rv: 2.0.1) Gecko/20100101 Firefox/4.0.1 ',] count = 0def Get_proxy_ip (): headers = {'host': 'www. x Icidaili.com ', 'user-agent': 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)', 'accept': r'application/json, text/javascript, */*; q = 0.01 ', 'Referer': r'http: // www.xicidaili.com/',} req = request. request (r 'HTTP: // www.xicidaili.com/nn/', headers = headers) response = request. urlopen (req) html = response. read (). decode ('utf-8') proxy_list = [] ip_list = re. findall (R' \ d + \. \ d + \. \ d + \. \ d + ', html) port_l Ist = re. findall (R' <td> \ d + </td> ', html) for I in range (len (ip_list): ip = ip_list [I] port = re. sub (R' <td >|</td> ', '', port_list [I]) proxy =' % s: % s' % (ip, port) principal (proxy) return proxy_listdef Proxy_read (proxy_list, user_agent_list, I): proxy_ip = proxy_list [I] print ('current proxy ip: % s' % proxy_ip) user_agent = random. choice (user_agent_list) print ('current proxy user_agent: % s' % user_agent) sleep_time = ran Dom. randint (1, 3) print ('wait time: % s '% sleep_time) time. sleep (sleep_time) print ('get started ') headers = {'host': 's9 -im-policy.csdn.net', 'origin': 'http: // blog.csdn.net ', 'User-agent': user_agent, 'accept': r'application/json, text/javascript, */*; q = 6666', 'Referer': r'http: // blog.csdn.net/u037920031/article/details/51068703',} proxy_support = request. proxyHandler ({'HTTP ': proxy_ip}) opener = request. bui Ld_opener (proxy_support) request. install_opener (opener) req = request. request (r 'HTTP: // blog.csdn.net/u042520031/article/details/51068703', headers#headers) try: html = request. urlopen (req ). read (). decode ('utf-8') failed t Exception as e: print ('****** open failed! * ***** ') Else: global count + = 1 print (' OK! Total % s successful! '% Count) if _ name _ =' _ main _ ': proxy_list = Get_proxy_ip () for I in range (100): Proxy_read (proxy_list, user_agent_list, i)
The above is all the content of this article. I hope this article will help you in your study or work. I also hope to provide more support to the customer's home!