An explanation of IP auto-proxy method using Python to crawl soft test questions

Last Update:2017-03-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently in order to take the exam to crawl online soft exam questions, in the crawl encountered some problems, the following article is mainly about the use of http://www.php.cn/wiki/1514.html "target=" _blank "> Python Crawl The soft exam question of the IP automatic agent of the relevant information, the article is described in very detailed, the need for friends below to see it together.

Objective

Recently there is a software Professional level test, hereinafter referred to as soft test, in order to better review preparation, I intend to crawl www.rkpass.cn online soft test questions.

Let me start with the Keng (shi) that I climbed the soft exam question. Now I can automatically crawl all the topics of a module, such as:

At present, the information System Supervisor's 30 test questions can be captured, as shown in the results:

Grab down the contents of the picture:

Although some information can be captured down, but the quality of the code is not high, to crawl information systems supervisors for example, because the goal is clear, the parameters are clear, in order to pursue in a short time to crawl the information, so did not do unusual treatment, last night filled a long pit.

Back to the topic, this blog today because of the new pit. From the text of the title we can guess a probably, is definitely the number of requests too many, so IP is the site's anti-crawler mechanism to be sealed.

The living cannot let the urine suppress the death, the revolutionary ancestor's deeds told us, as the socialist successor, we cannot succumb to the difficulty, Sankai, meets the water bridge, in order to solve the IP question, the IP proxy this idea is coming out.

In the process of crawling information, if the crawl frequency is higher than the setting threshold of the website, it will be forbidden to access. Often, the site's anti-crawler mechanism is based on IP to identify the crawler.

So the crawler developers usually need to take two measures to solve this problem:

1, slow down the crawl speed, reduce the pressure on the target site. However, this reduces the amount of time it takes to crawl units.

2, the second method is to set up proxy IP and other means to break the anti-crawler mechanism to continue high-frequency crawl. However, this requires multiple stable proxy IPs.

Words not many books, directly on the code:

# IP address from domestic GAO anonymous proxy IP site: http://www.xicidaili.com/nn/# just crawl home IP address is sufficient for general use from BS4 import Beautifulsoupimport Requestsimport random# Gets the ipdef get_ip_list (URL, headers) on the current page: Web_data = requests.get (URL, headers=headers) soup = BeautifulSoup (web_ Data.text) ips = Soup.find_all (' tr ') ip_list = [] for I in range (1, Len (IPS)): Ip_info = ips[i] TDs = Ip_info.find_all (' TD ') ip_list.append (Tds[1].text + ': ' + tds[2].text) return ip_list# randomly fetch a ipdef get_random_ip (ip_list) from the captured IP: proxy_ list = [] for IP in Ip_list:proxy_list.append (' http://' + IP) proxy_ip = random.choice (proxy_list) proxies = {' HTTPS ' £ º Pro XY_IP} return proxies# domestic high stealth proxy IP network primary address URL = ' http://www.xicidaili.com/nn/' #请求头headers = {' user-agent ': ' mozilla/5.0 ( Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.143 safari/537.36 '} #计数器, according to the counter to loop crawl all the pages Ipnum = 0# Create an array, Storing captured IP to array ip_array = []while num < 1537:num + = 1 ip_list = get_ip_list (Url+str (num), headers=headers) Ip_array.appen D (ip_list) for IP in Ip_array: Print (IP) #创建随机数, randomly taken to a ip# proxies = Get_random_ip (ip_list) # Print (proxies)

Operation Result:

In this way, when the request of the crawler, the request IP is set to automatic IP, it can effectively avoid the anti-crawler mechanism of a simple blocking fixed IP this means.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

An explanation of IP auto-proxy method using Python to crawl soft test questions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

An explanation of IP auto-proxy method using Python to crawl soft test questions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support