This article mainly introduces the knowledge of implementing asynchronous proxy crawler and proxy pool in Python, which has good reference value, next, let's take a look at it. This article mainly introduces the knowledge of implementing asynchronous proxy crawler and proxy pool in Python, which has good reference value. let's take a look at it together with the small editor.
Python asyncio implements an asynchronous proxy pool, crawls free proxies on the proxy website according to rules, and stores them in redis after verifying that they are valid, regularly expand the number of proxies, check the validity of proxies in the pool, and remove invalid proxies. At the same time, a server is implemented using aiohttp. other programs can access the corresponding url to obtain the proxy from the proxy pool.
Source code
Https://github.com/arrti/proxypool
Environment
Python 3.5 +
Redis
PhantomJS (optional)
Supervisord (optional)
Because the async and await syntaxes of asyncio are widely used in the code, they are provided only in Python3.5, so it is best to use Python3.5 and later versions. I use Python3.6.
Dependency
Redis
Aiohttp
Bs4
Lxml
Requests
Selenium
The selenium package is mainly used to operate PhantomJS.
The following describes the code.
1. crawler
Core code
Async def start (self): for rule in self. _ rules: parser = asyncio. ensure_future (self. _ parse_page (rule) # obtain the agent logger based on the rule parsing page. debug ('{0} crawler started '. format (rule. rule_name) if not rule. use_phantomjs: await page_download (ProxyCrawler. _ url_generator (rule), self. _ pages, self. _ stop_flag) # crawl the else of the proxy website: await page_download_phantomjs (ProxyCrawler. _ url_generator (rule), self. _ pages, rule. phantomjs_load_flag, self. _ stop_flag) # use PhantomJS to crawl await self. _ pages. join () parser. cancel () logger. debug ('{0} crawler finished '. format (rule. rule_name ))
The above core code is actually a production-consumer model implemented using asyncio. Queue. Below is a simple implementation of this model:
Import asynciofrom random import randomasync def produce (queue, n): for x in range (1, n + 1): print ('produce ', x) await asyncio. sleep (random () await queue. put (x) # put itemasync def consume (queue): while 1: item = await queue to the queue. get () # wait to get the item print ('sume', item) await asyncio from the queue. sleep (random () queue. task_done () # notify queue that the current item has been processed async def run (n): queue = asyncio. queue () consumer = asyncio. ensure_future (consume (queue) await produce (queue, n) # wait until the producer ends await queue. join () # blocking until the queue is not empty. cancel () # cancel the consumer task, otherwise it will always block def aio_queue_run (n): loop = asyncio in The get method. get_event_loop () try: loop. run_until_complete (run (n) # run event loop continuously until task run (n) ends finally: loop. close () if name = 'main': aio_queue_run (5)
To run the above code, a possible output is as follows:
produce 1produce 2consume 1produce 3produce 4consume 2produce 5consume 3consume 4consume 5
Crawling pages
Async def page_download (urls, pages, flag): url_generator = urls async with aiohttp. clientSession () as session: for url in url_generator: if flag. is_set (): break await asyncio. sleep (uniform (delay-0.5, delay + 1) logger. debug ('crawler proxy web page {0 }'. format (url) try: async with session. get (url, headers = headers, timeout = 10) as response: page = await response. text () parsed = html. fromstring (decode_html (page) # use bs4 to assist lxml in decoding web pages: http://lxml.de/elementsoup.html#Using only the encoding detection await pages. put (parsed) url_generator.send (parsed) # obtain the address of the next page based on the current page. timeoutError: logger. error ('crawling {0} timeout '. format (url) continue # TODO: use a proxy failed t Exception as e: logger. error (e)
The web page crawling function implemented using aiohttp. most proxy websites can use the above method to crawl, for websites that use js to dynamically generate pages, selenium can be used to control PhantomJS for crawling-this project does not require high crawler efficiency, and the update frequency of proxy websites is limited, you can use PhantomJS without frequent crawling.
Resolution proxy
The simplest thing is to use xpath to parse the proxy. if you use Chrome, you can right-click to get the xpath of the selected page element:
Install Chrome's extension "XPath Helper" to run and debug xpath directly on the page, which is very convenient:
BeautifulSoup does not support xpath and uses lxml to parse pages. the code is as follows:
Async def _ parse_proxy (self, rule, page): ips = page. xpath (rule. ip_xpath) # obtain the list-type IP address set ports = page according to xpath parsing. xpath (rule. port_xpath) # obtain the list-type IP address set if not ips or not ports: logger according to xpath parsing. warning ('{2} crawler cocould not get ip (len = {0}) or port (len = {1}), please check the xpaths or network '. format (len (ips), len (ports), rule. rule_name) return proxies = map (lambda x, y: '{0 }:{ 1 }'. format (x. text. strip (), y. text. strip (), ips, ports) if rule. filters: # filter the proxy based on the filter fields, such as "high margin" and "transparent" filters = [] for I, ft in enumerate (rule. filters_xpath): field = page. xpath (ft) if not field: logger. warning ('{1} crawler cocould not get {0} field, please check the filter xpath '. format (rule. filters [I], rule. rule_name) continue filters. append (map (lambda x: x. text. strip (), field) filters = zip (* filters) selector = map (lambda x: x = rule. filters, filters) proxies = compress (proxies, selector) for proxy in proxies: await self. _ proxies. put (proxy) # put the parsed proxy into asyncio. queue
Crawler rules
The rules for website crawling, proxy resolution, filtering, and other operations are defined by the rule classes of each proxy website. The metadata class and the base class are used to manage the rule classes. The base class is defined as follows:
class CrawlerRuleBase(object, metaclass=CrawlerRuleMeta): start_url = None page_count = 0 urls_format = None next_page_xpath = None next_page_host = '' use_phantomjs = False phantomjs_load_flag = None filters = () ip_xpath = None port_xpath = None filters_xpath = ()
The meanings of parameters are as follows:
start_url
(Required)
The starting page of the crawler.
ip_xpath
(Required)
The xpath rule for crawling IP addresses.
port_xpath
(Required)
The xpath rule for crawling the port number.
page_count
The number of crawled pages.
urls_format
The format string of the page address. URL _ format.format (start_url, n) is used to generate the address of page n. This is a common page address format.
next_page_xpath
,next_page_host
Get the url of the next page (usually relative path) based on the xpath rule, and get the url of the next page based on host: next_page_host + url.
use_phantomjs
, phantomjs_load_flag
Use_phantomjs is used to identify whether to use PhantomJS to crawl the website. if used, you need to define phantomjs_load_flag (an element on the webpage, 'str' type) as the marker for loading the PhantomJS page.
filters
Filter a set of fields, which can be iterated. Used to filter the proxy.
Crawls the xpath rules of each filter field, which is one-to-one matching of the filter field in order.
The meta-class CrawlerRuleMeta is used to manage the definition of the rule class. for example, if use_phantomjs is defined as True, phantomjs_load_flag must be defined. Otherwise, an exception is thrown, which is not described here.
Currently, the following rules are implemented: Xibei proxy, QuickShield proxy, 360 proxy, 66 proxy, and secret proxy. Adding a rule class is also very simple. you can inherit the crawler rulebase to define the new rule class YourRuleClass, put it in the proxypool/rules directory, and put it in the init. add from. import YourRuleClass (in this way, the CrawlerRuleBase. subclasses () can get all the rule classes), restart the running proxy pool to apply the new rules.
2. Inspection
Although there are many free agents, there are not many available agents, so after crawling to the agent, you need to check them. valid agents can be placed in the agent pool, and the agent is also time-sensitive, check the proxy in the pool regularly to remove the expired proxy in time.
This part is simple. aiohttp is used to access a website through a proxy. if timeout occurs, the proxy is invalid.
async def validate(self, proxies): logger.debug('validator started') while 1: proxy = await proxies.get() async with aiohttp.ClientSession() as session: try: real_proxy = 'http://' + proxy async with session.get(self.validate_url, proxy=real_proxy, timeout=validate_timeout) as resp: self._conn.put(proxy) except Exception as e: logger.error(e) proxies.task_done()
3. server section
A web server is implemented using aiohttp. after it is started, you can access http: // host: port to display the home page:
Access http: // host: port/get to get one proxy from the proxy pool, for example, '2017. 0.0.1: 8080 ';
Access http: // host: port/get/n to obtain n proxies from the proxy pool, for example, "['123. 0.0.1: 100000', '123. 0.0.1: 100000', '123. 0.0.1: 80'] ";
Access http: // host: port/count to obtain the proxy pool capacity, for example, '42 '.
Because the home page is a static html page, in order to avoid the overhead of opening, reading and closing the html file for every request to access the home page, it is cached in redis, the modification time of the html file is used to determine whether the file has been modified. if the modification time is different from the modification time of the redis Cache, the file is read again if the html file has been modified, and update the cache. Otherwise, you can obtain the homepage content from redis.
The returned proxy is throughaiohttp.web.Response(text=ip.decode('utf-8'))
To achieve this, text requires the str type, and the bytes type is obtained from redis, which requires conversion. The multiple proxies returned can be converted to the list type using eval.
The returned home page is different.aiohttp.web.Response(body=main_page_cache, content_type='text/html')
Here, the body requires the bytes type. you can directly return the cache obtained from redis,conten_type='text/html'
Otherwise, you cannot load the home page through a browser. Instead, you can download the home page. Note this when running the sample code in the official documentation, basically, content_type is not set in the sample code.
This part is not complicated. pay attention to the points mentioned above. for the path of static resource files used on the home page, refer to the previous blog "add static resource path for aiohttp".
4. run
The entire proxy pool is divided into three independent parts:
Proxypool
Check the proxy pool capacity on a regular basis. if it falls below the lower limit, start the proxy crawler and check the proxy. crawlers that pass the inspection are placed in the proxy pool. when the quantity reaches the specified limit, the crawler is stopped.
Proxyvalidator
It is used to regularly check the proxy in the proxy pool and remove the invalid proxy.
Proxyserver
Start the server.
These three independent tasks run through three processes. in Linux, supervisod can be used to manage these processes. The following is an example of the supervisord configuration file:
; supervisord.conf[unix_http_server]file=/tmp/supervisor.sock [inet_http_server] port=127.0.0.1:9001 [supervisord]logfile=/tmp/supervisord.log logfile_maxbytes=5MB logfile_backups=10 loglevel=debug pidfile=/tmp/supervisord.pid nodaemon=false minfds=1024 minprocs=200 [rpcinterface:supervisor]supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface[supervisorctl]serverurl=unix:///tmp/supervisor.sock[program:proxyPool]command=python /path/to/ProxyPool/run_proxypool.py redirect_stderr=truestdout_logfile=NONE[program:proxyValidator]command=python /path/to/ProxyPool/run_proxyvalidator.pyredirect_stderr=true stdout_logfile=NONE[program:proxyServer]command=python /path/to/ProxyPool/run_proxyserver.pyautostart=falseredirect_stderr=true stdout_logfile=NONE
Because the project has configured logs, you do not need to use supervisord to capture stdout and stderr here. Through supervisord-c supervisord. conf starts supervisord, and proxyPool and proxyServer start automatically. proxyServer needs to be started manually. you can access http: // 127.0.0.1: 9001 to manage the three processes through the webpage:
Supervisod official documentation said that currently (Version 3.3.1) does not support python3, but I did not find any problems during use, probably because I did not use the complex functions of supervisord, it serves as a simple process status monitoring and start/stop tool.
The above describes how to implement asynchronous proxy crawler and proxy pool methods using Python. For more information, see other related articles in the first PHP community!