Recently in order to take the exam to crawl online soft exam questions, in the crawl encountered some problems, the following article is mainly about the use of http://www.php.cn/wiki/1514.html "target=" _blank "> Python Crawl The soft exam question of the IP automatic agent of the relevant information, the article is described in very detailed, the need for friends below to see it together.
Objective
Recently there is a software Professional level test, hereinafter referred to as soft test, in order to better review preparation, I intend to crawl www.rkpass.cn online soft test questions.
Let me start with the Keng (shi) that I climbed the soft exam question. Now I can automatically crawl all the topics of a module, such as:
At present, the information System Supervisor's 30 test questions can be captured, as shown in the results:
Grab down the contents of the picture:
Although some information can be captured down, but the quality of the code is not high, to crawl information systems supervisors for example, because the goal is clear, the parameters are clear, in order to pursue in a short time to crawl the information, so did not do unusual treatment, last night filled a long pit.
Back to the topic, this blog today because of the new pit. From the text of the title we can guess a probably, is definitely the number of requests too many, so IP is the site's anti-crawler mechanism to be sealed.
The living cannot let the urine suppress the death, the revolutionary ancestor's deeds told us, as the socialist successor, we cannot succumb to the difficulty, Sankai, meets the water bridge, in order to solve the IP question, the IP proxy this idea is coming out.
In the process of crawling information, if the crawl frequency is higher than the setting threshold of the website, it will be forbidden to access. Often, the site's anti-crawler mechanism is based on IP to identify the crawler.
So the crawler developers usually need to take two measures to solve this problem:
1, slow down the crawl speed, reduce the pressure on the target site. However, this reduces the amount of time it takes to crawl units.
2, the second method is to set up proxy IP and other means to break the anti-crawler mechanism to continue high-frequency crawl. However, this requires multiple stable proxy IPs.
Words not many books, directly on the code:
# IP address from domestic GAO anonymous proxy IP site: http://www.xicidaili.com/nn/# just crawl home IP address is sufficient for general use from BS4 import Beautifulsoupimport Requestsimport random# Gets the ipdef get_ip_list (URL, headers) on the current page: Web_data = requests.get (URL, headers=headers) soup = BeautifulSoup (web_ Data.text) ips = Soup.find_all (' tr ') ip_list = [] for I in range (1, Len (IPS)): Ip_info = ips[i] TDs = Ip_info.find_all (' TD ') ip_list.append (Tds[1].text + ': ' + tds[2].text) return ip_list# randomly fetch a ipdef get_random_ip (ip_list) from the captured IP: proxy_ list = [] for IP in Ip_list:proxy_list.append (' http://' + IP) proxy_ip = random.choice (proxy_list) proxies = {' HTTPS ' £ º Pro XY_IP} return proxies# domestic high stealth proxy IP network primary address URL = ' http://www.xicidaili.com/nn/' #请求头headers = {' user-agent ': ' mozilla/5.0 ( Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.143 safari/537.36 '} #计数器, according to the counter to loop crawl all the pages Ipnum = 0# Create an array, Storing captured IP to array ip_array = []while num < 1537:num + = 1 ip_list = get_ip_list (Url+str (num), headers=headers) Ip_array.appen D (ip_list) for IP in Ip_array: Print (IP) #创建随机数, randomly taken to a ip# proxies = Get_random_ip (ip_list) # Print (proxies)
Operation Result:
In this way, when the request of the crawler, the request IP is set to automatic IP, it can effectively avoid the anti-crawler mechanism of a simple blocking fixed IP this means.