An explanation of IP auto-proxy method using Python to crawl soft test questions

Source: Internet
Author: User
Recently in order to take the exam to crawl online soft exam questions, in the crawl encountered some problems, the following article is mainly about the use of http://www.php.cn/wiki/1514.html "target=" _blank "> Python Crawl The soft exam question of the IP automatic agent of the relevant information, the article is described in very detailed, the need for friends below to see it together.

Objective

Recently there is a software Professional level test, hereinafter referred to as soft test, in order to better review preparation, I intend to crawl www.rkpass.cn online soft test questions.

Let me start with the Keng (shi) that I climbed the soft exam question. Now I can automatically crawl all the topics of a module, such as:

At present, the information System Supervisor's 30 test questions can be captured, as shown in the results:

Grab down the contents of the picture:

Although some information can be captured down, but the quality of the code is not high, to crawl information systems supervisors for example, because the goal is clear, the parameters are clear, in order to pursue in a short time to crawl the information, so did not do unusual treatment, last night filled a long pit.

Back to the topic, this blog today because of the new pit. From the text of the title we can guess a probably, is definitely the number of requests too many, so IP is the site's anti-crawler mechanism to be sealed.

The living cannot let the urine suppress the death, the revolutionary ancestor's deeds told us, as the socialist successor, we cannot succumb to the difficulty, Sankai, meets the water bridge, in order to solve the IP question, the IP proxy this idea is coming out.

In the process of crawling information, if the crawl frequency is higher than the setting threshold of the website, it will be forbidden to access. Often, the site's anti-crawler mechanism is based on IP to identify the crawler.

So the crawler developers usually need to take two measures to solve this problem:

1, slow down the crawl speed, reduce the pressure on the target site. However, this reduces the amount of time it takes to crawl units.

2, the second method is to set up proxy IP and other means to break the anti-crawler mechanism to continue high-frequency crawl. However, this requires multiple stable proxy IPs.

Words not many books, directly on the code:

# IP address from domestic GAO anonymous proxy IP site: http://www.xicidaili.com/nn/# just crawl home IP address is sufficient for general use from BS4 import Beautifulsoupimport Requestsimport random# Gets the ipdef get_ip_list (URL, headers) on the current page: Web_data = requests.get (URL, headers=headers) soup = BeautifulSoup (web_ Data.text) ips = Soup.find_all (' tr ') ip_list = [] for I in range (1, Len (IPS)): Ip_info = ips[i] TDs = Ip_info.find_all (' TD ') ip_list.append (Tds[1].text + ': ' + tds[2].text) return ip_list# randomly fetch a ipdef get_random_ip (ip_list) from the captured IP: proxy_ list = [] for IP in Ip_list:proxy_list.append (' http://' + IP) proxy_ip = random.choice (proxy_list) proxies = {' HTTPS ' £ º Pro XY_IP} return proxies# domestic high stealth proxy IP network primary address URL = ' http://www.xicidaili.com/nn/' #请求头headers = {' user-agent ': ' mozilla/5.0 ( Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.143 safari/537.36 '} #计数器, according to the counter to loop crawl all the pages Ipnum = 0# Create an array, Storing captured IP to array ip_array = []while num < 1537:num + = 1 ip_list = get_ip_list (Url+str (num), headers=headers) Ip_array.appen D (ip_list) for IP in Ip_array: Print (IP) #创建随机数, randomly taken to a ip# proxies = Get_random_ip (ip_list) # Print (proxies) 

Operation Result:

In this way, when the request of the crawler, the request IP is set to automatic IP, it can effectively avoid the anti-crawler mechanism of a simple blocking fixed IP this means.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.