A deep understanding of the Python crawler proxy pool service and a deep understanding of python Crawlers
The company built a stable proxy pool service for Distributed Deep Web crawlers to provide effective proxy services for thousands of crawlers, ensuring that all crawlers receive valid proxy IP addresses for their websites, so as to ensure the fast and stable operation of crawlers, of course, the work done in the company cannot be open-source. However, in my spare time, I want to use some free resources to create a simple proxy pool service.
1. Problems
Where does the proxy IP come from?
When I was just a self-taught crawler, I went to the west Thorn, fast proxy and other websites with free proxy to crawl without a proxy IP, or some agents could work. Of course, if you have a better proxy interface, you can also access it on your own.
Free proxy collection is also very simple, nothing more than: access the page-> Regular Expression/xpath extraction-> Save
How to Ensure agent quality?
It is certain that most of the Free proxy IP addresses are unavailable, otherwise why do others still pay (but the fact is that the paid IP addresses of many agents are also unstable, and many are unavailable ). Therefore, the acquired proxy IP address cannot be used directly. You can write detection programs to constantly use these proxies to access a stable website to see if it can be used normally. This process can be implemented using multiple threads or asynchronous methods, because detecting proxy is a very slow process.
How do I store the collected proxy?
Here we have to recommend a NoSQL database SSDB with high performance and support multiple data structures for proxy Redis. Supports queue, hash, set, and k-v pairs, and T-level data. It is a good intermediate storage tool for Distributed crawlers.
How can crawlers use these proxies more easily?
The answer must be a service. python has so many web frameworks that you can use to write an api for crawlers to call. This has many advantages. For example, when the crawler finds that the proxy cannot be used, it can take the initiative to delete the proxy IP through the api. When the crawler finds that the proxy pool IP address is insufficient, it can take the initiative to go To the refresh proxy pool. This is more reliable than the detection program.
2. Proxy Pool Design
The proxy pool consists of four parts:
ProxyGetter:
Proxy Retrieval Interface. Currently, there are five free proxy sources. Each call will capture the latest proxy of the five websites and place them in the DB. You can add additional proxy retrieval interfaces on your own;
DB:
Used to store proxy IP addresses. Currently, only SSDB is supported. As for why we chose SSDB, you can refer to this article. I personally think SSDB is a good alternative to Redis. If you have never used SSDB, it is easy to install it. For more information, see here;
Schedule:
Scheduled task users regularly check the agent availability in the database and delete unavailable agents. At the same time, it will take the initiative to get the latest proxy into the DB through ProxyGetter;
ProxyApi:
The external interface of the proxy pool, because the proxy pool function is relatively simple now, it takes two hours to read Flask, and it is a pleasant decision to use Flask. This function provides crawlers with get, delete, refresh, and other interfaces to facilitate direct use of crawlers.
3. Code Module
Python's high-level data structure, dynamic type, and dynamic binding make it very suitable for Rapid Application Development, and also suitable for connecting existing software components as a glue language. Using Python to create this proxy IP address pool is also very simple. The code is divided into six modules:
Api:
Api-related code. Currently, the api is implemented by Flask, and the code is very simple. The client requests are sent to Flask. Flask calls the implementation in ProxyManager, including get/delete/refresh/get_all;
DB:
Database-related code. Currently, SSDB is used for databases. The code is implemented in the factory mode to facilitate the expansion of other types of databases in the future;
Manager:
The specific implementation classes of interfaces such as get/delete/refresh/get_all. Currently, the proxy pool is only responsible for managing proxy. More functions may be available in the future, such as binding proxy and crawler, proxy and account binding;
ProxyGetter:
The Code obtained by the proxy currently crawls free agents for five websites: Fast proxy, proxy 66, proxy, Xibei proxy, and guobanjia, after testing, the five websites only have 60 or 70 available proxies updated every day. Of course, they also support their own extension proxy interfaces;
Schedule:
The Code related to the scheduled task is now only implemented to regularly refresh the code and verify the available proxy. multi-process mode is adopted;
Util:
Stores some common module methods or functions, including GetConfig: reads the config file. ini class: ConfigParse: integrates the class that overrides ConfigParser To Make It case sensitive. Singleton: Implements Singleton and LazyProperty: Implements class attribute inertia calculation. And so on;
Other files:
Configuration File: Config. ini. You can add a new proxy Acquisition Method in GetFreeProxy and register it in Config. ini;
4. Installation
Download Code:
Python
Git clone git@github.com: jhao104/proxy_pool.git or go directly to the https://github.com/jhao104/proxy_pool to download the zip file git clone git@github.com: jhao104/proxy_pool.git or go directly to the https://github.com/jhao104/proxy_pool to download the zip file
Install dependency:
Python
pip install -r requirements.txtpip install -r requirements.txt
Start:
Python
You need to start the scheduled task and the api to Config respectively. configure your SSDB in ini to the Schedule Directory: >>> python ProxyRefreshSchedule. py to the Api Directory: >>> python ProxyApi. py needs to start the scheduled task and the api to Config respectively. configure your SSDB in ini to the Schedule Directory: >>> python ProxyRefreshSchedule. py to the Api Directory: >>> python ProxyApi. py
5. Use
After the scheduled task is started, all the proxies of fetch are put into the database and verified by the proxy. By default, the task is executed every 20 minutes. After a scheduled task is started for about one or two minutes, you can see the available proxy refreshed in SSDB:
Useful_proxy
After ProxyApi. py is started, you can use the interface in the browser to obtain the proxy, which is in the browser:
Index page:
Get Page:
Get_all page:
This api can be encapsulated as a function for use in crawler Code. For example:
Python
import requestsdef get_proxy(): return requests.get("http://127.0.0.1:5000/get/").contentdef delete_proxy(proxy): requests.get("http://127.0.0.1:5000/delete/?proxy={}".format(proxy))# your spider codedef spider(): # .... requests.get('https://www.example.com', proxies={"http": "http://{}".format(get_proxy)}) # ....import requestsdef get_proxy(): return requests.get("http://127.0.0.1:5000/get/").contentdef delete_proxy(proxy): requests.get("http://127.0.0.1:5000/delete/?proxy={}".format(proxy))# your spider codedef spider(): # .... requests.get('https://www.example.com', proxies={"http": "http://{}".format(get_proxy)}) # ....
6. Last
The time is too short, the functions and code are relatively simple, and there will be time for improvement later. I like to give a star on github. Thank you!
Github address: https://github.com/jhao104/proxy_pool
Summary
The above is the Python crawler proxy pool service introduced by xiaobian. I hope it will help you. If you have any questions, please leave a message and I will reply to you in a timely manner. Thank you very much for your support for the help House website!