Use python to crawl the ip address of the soft exam questions, and use the ip address of the python crawler questions

Source: Internet
Author: User

Use python to crawl the ip address of the soft exam questions, and use the ip address of the python crawler questions

Preface

Recently, I have a software professional grade examination, hereinafter referred to as the soft exam. In order to better review and prepare for the examination, I plan to capture the soft exam questions on www.rkpass.cn.

First, let's talk about how I crawled the soft exam question (keng) (shi ). Now I can automatically capture all the questions of a module, such:

Currently, you can capture all the 30 questions recorded by the Information System Supervisor, as shown in the results:

Captured content image:

Although some information can be captured, the quality of the Code is not high. For example, the capture information system invigilator has a clear goal and clear parameters, in order to capture the exam information in a short period of time, no exception handling was performed. I entered a long pitfall last night.

Back to the topic, I wrote this blog today because I encountered a new challenge. We can guess from the title of this article that the number of requests is too large, so the ip address is blocked by the anti-crawler mechanism of the website.

Living people cannot hold our urine to death. The deeds of the revolutionary predecessors tell us that, as successors of socialism, we cannot succumb to difficulties, open roads, and bridge networks. In order to solve the ip address problem, the idea of ip proxy came out.

When crawlers crawl information, access is prohibited if the crawling frequency exceeds the threshold of the website. Generally, the anti-crawler mechanism of websites identifies crawlers Based on IP addresses.

Therefore, crawler developers usually need to take two measures to solve this problem:

1. Slow down the crawling speed and reduce the pressure on the target website. However, this will reduce the crawling volume per unit time.

2. The second method is to break through the Anti-crawler Mechanism to continue high-frequency crawling by setting proxy IP addresses and other means. However, multiple stable proxy IP addresses are required.

If there are not many books, go to the Code directly:

# IP address from domestic zookeeper proxy IP site: http://www.xicidaili.com/nn/# only crawling home page IP address is enough to use from bs4 import BeautifulSoupimport requestsimport random # Get the current page on the ipdef get_ip_list (url, headers ): web_data = requests. get (url, headers = headers) soup = BeautifulSoup (web_data.text) ips = soup. find_all ('tr ') ip_list = [] for I in range (1, len (ips): ip_info = ips [I] tds = ip_info.find_all ('td ') ip_list.append (tds [1]. text + ':' + tds [2]. text) return ip_list # obtain an ipdef get_random_ip (ip_list): proxy_list = [] for Ip in ip_list: proxy_list.append ('HTTP: // '+ ip) randomly from the captured ip Address) proxy_ip = random. choice (proxy_list) proxies = {'HTTP ': proxy_ip} return proxies # domestic high-rise proxy IP Address url = 'HTTP: // www.xicidaili.com/nn/'your Request Header headers = {'user-agent ': 'mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/200'} # counter, cyclically capture the ipnum of all pages by counter = 0 # create an array and store the captured ip address to the array ip_array = [] while num <1537: num + = 1 ip_list = get_ip_list (url + str (num), headers = headers) ip_array.append (ip_list) for ip in ip_array: print (ip) # create a random number, obtain an ip address at random # proxies = get_random_ip (ip_list) # print (proxies)

Running result:

In this way, when crawling requests, setting the request ip address as an automatic ip Address can effectively escape the simple anti-crawler method of blocking the fixed ip address.

Certificate -------------------------------------------------------------------------------------------------------------------------------------

For the stability of the website, the crawling speed is still under control. After all, it is not easy for webmasters. In this article, only 17 IPs are captured.

Summary

The above is all about this article. I hope this article will help you learn or use python. If you have any questions, please leave a message, thank you for your support.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.