Python crawler proxy IP pool implementation method

Source: Internet
Author: User
The company built a stable proxy pool service for distributed deep web crawlers to provide effective proxy services for thousands of crawlers, ensuring that all crawlers receive valid proxy IP addresses for their websites, this ensures the fast and stable operation of Crawlers. Therefore, we want to use some free resources to create a simple proxy pool service. The company built a stable proxy pool service for distributed deep web crawlers to provide effective proxy services for thousands of crawlers, ensuring that all crawlers receive valid proxy IP addresses for their websites, this ensures the fast and stable operation of Crawlers. Therefore, we want to use some free resources to create a simple proxy pool service.

The company built a stable proxy pool service for distributed deep web crawlers to provide effective proxy services for thousands of crawlers, ensuring that all crawlers receive valid proxy IP addresses for their websites, so as to ensure the fast and stable operation of crawlers, of course, the work done in the company cannot be open-source. However, in my spare time, I want to use some free resources to create a simple proxy pool service.

1. problems

Where does the proxy IP come from?
When I was just a self-taught crawler, I went to the West Thorn, fast proxy and other websites with free proxy to crawl without a proxy IP, or some agents could work. Of course, if you have a better proxy interface, you can also access it on your own. Free proxy collection is also very simple, nothing more than: access the page-> Regular expression/xpath extraction-> Save

How to ensure agent quality?
It is certain that most of the free proxy IP addresses are unavailable, otherwise why do others still pay (but the fact is that the paid IP addresses of many agents are also unstable, and many are unavailable ). Therefore, the acquired proxy IP address cannot be used directly. you can write detection programs to constantly use these proxies to access a stable website to see if it can be used normally. This process can be implemented using multiple threads or asynchronous methods, because detecting proxy is a very slow process.

How do I store the collected proxy?
Here we have to recommend a NoSQL database SSDB with high performance and support multiple data structures for proxy Redis. Supports queue, hash, set, and k-v pairs, and T-level data. It is a good intermediate storage tool for distributed crawlers.

How can crawlers use these proxies more easily?
The answer must be a service. python has so many web frameworks that you can use to write an api for crawlers to call. This has many advantages. for example, when the crawler finds that the proxy cannot be used, it can take the initiative to delete the proxy IP through the api. when the crawler finds that the proxy pool IP address is insufficient, it can take the initiative to go to the refresh proxy pool. This is more reliable than the detection program.

2. proxy pool design

The proxy pool consists of four parts:

ProxyGetter:
Proxy retrieval interface. Currently, there are five free proxy sources. each call will capture the latest proxy of the five websites and place them in the DB. you can add additional proxy retrieval interfaces on your own;

DB:
Used to store proxy IP addresses. Currently, only SSDB is supported. As for why we chose SSDB, you can refer to this article. I personally think SSDB is a good alternative to Redis. if you have never used SSDB, it is easy to install it. For more information, see here;

Schedule:
Scheduled task users regularly check the agent availability in the database and delete unavailable agents. At the same time, it will take the initiative to get the latest proxy into the DB through ProxyGetter;

ProxyApi:
The external interface of the proxy pool, because the proxy pool function is relatively simple now, it takes two hours to read Flask, and it is a pleasant decision to use Flask. This function provides crawlers with get, delete, refresh, and other interfaces to facilitate direct use of crawlers.

[HTML_REMOVED] design

3. code module

Python's high-level data structure, dynamic type, and dynamic binding make it very suitable for rapid application development, and also suitable for connecting existing software components as a glue language. Using Python to create this proxy IP address pool is also very simple. the code is divided into six modules:

Api: api-related code. Currently, the api is implemented by Flask, and the code is very simple. The client requests are sent to Flask. Flask calls the implementation in ProxyManager, includingget/delete/refresh/get_all;

DB: database-related code. Currently, SSDB is used for databases. The code is implemented in the factory mode to facilitate the expansion of other types of databases in the future;

Manager:get/delete/refresh/get_allThe proxy pool is currently only responsible for managing proxy, and more functions may be available in the future, such as proxy and crawler binding, proxy and account binding, etc;

ProxyGetter: the code obtained by the proxy. Currently, it captures free agents for five websites: fast proxy, proxy 66, proxy, Xibei proxy, and guobanjia, after testing, the five websites only have 60 or 70 available proxies updated every day. of course, they also support their own extension proxy interfaces;

Schedule: The code related to the scheduled task. now, the code is refreshed regularly and the available proxy is verified. the multi-process method is used;

Util: stores some common module methods or functions, includingGetConfig: Read the config. ini class of the configuration file,ConfigParse: Integration rewritingConfigParserTo make it case sensitive,Singleton: Implement Singleton,LazyProperty: Implements class attribute inertia calculation. And so on;

Other files: Configuration File: Config. ini, database configuration and proxy interface configuration. you can add a new proxy acquisition method in GetFreeProxy and register it in Config. ini;

4. Installation

Download Code:

Git clone git@github.com: jhao104/proxy_pool.git or download the zip file directly to the https://github.com/jhao104/proxy_pool

Install dependency:

pip install -r requirements.txt

Start:

You need to start the scheduled task and configure your SSDB to the Schedule directory in Config. ini respectively: >>> python ProxyRefreshSchedule. py to the api Directory: >>> python ProxyApi. py

5. use

After the scheduled task is started, all the proxies of fetch are put into the database and verified by the proxy. By default, the task is executed every 20 minutes. After a scheduled task is started for about one or two minutes, you can see the available proxy refreshed in SSDB:

This api can be encapsulated as a function for use in crawler code. for example:

import requestsdef get_proxy():  return requests.get("http://127.0.0.1:5000/get/").contentdef delete_proxy(proxy):  requests.get("http://127.0.0.1:5000/delete/?proxy={}".format(proxy))# your spider codedef spider():  # ....  requests.get('https://www.example.com', proxies={"http": "http://{}".format(get_proxy)})  # ....

6. Last

The time is too short, the functions and code are relatively simple, and there will be time for improvement later. I like to give a star on github. Thank you!

For more articles about how to implement the Python crawler proxy IP pool, refer to PHP Chinese network!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.