This paper mainly introduces the Python implementation of asynchronous agent crawler and agent Pool of knowledge, has a good reference value, following the small series to see it together
Use Python Asyncio to implement an asynchronous agent pool, according to the Rules crawl agent site free agent, after verifying that it is valid in Redis, regularly expand the number of agents and verify the effectiveness of the agent in the pool, remove the failed agent. At the same time, a server is implemented with Aiohttp, and other programs can obtain proxies from the agent pool by accessing the appropriate URLs.
Source
Https://github.com/arrti/proxypool
Environment
Python 3.5+
Redis
PHANTOMJS (optional)
Supervisord (optional)
Because of the extensive use of Asyncio async and await syntax in the code, they are provided in Python3.5, so it's best to use Python3.5 and above, and I'm using Python3.6.
Depend on
Redis
Aiohttp
Bs4
lxml
Requests
Selenium
The selenium package is mainly used to operate the PHANTOMJS.
Here's a description of the code.
1. Reptile Parts
Core code
Async def start: for rule in self._rules:parser = asyncio.ensure_future (rule) # to get agent log based on Rule parsing page Ger.debug (' {0} crawler started '. Format (rule.rule_name)) if not RULE.USE_PHANTOMJS: await Page_download ( Proxycrawler._url_generator (rule), Self._pages, Self._stop_flag) # Crawl The page of the proxy site else: await PAGE_DOWNLOAD_PHANTOMJS (proxycrawler._url_generator (rule), Self._pages,rule.phantomjs_load_flag, Self._stop_flag) # Use PHANTOMJS to crawl await Self._pages.join () Parser.cancel () logger.debug (' {0} crawler finished '. Format (rule.rule_name))
The core code above is actually a asyncio. Queue implementation of the production-consumer model, here is a simple implementation of the model:
Import asynciofrom Random import Randomasync def produce (queue, N): for x in range (1, n + 1): print (' produce ', X) await a Syncio.sleep (random ()) await Queue.put (x) # put Itemasync def consume (queue) into the queue: while 1:item = await queue.get () # Wait from Queue gets item print (' Consume ', item) await Asyncio.sleep (random ()) Queue.task_done () # Notifies the queue that the current item has been processed by async def run (n) : Queue = Asyncio. Queue () consumer = asyncio.ensure_future (consume (queue)) await produce (queue, N) # waits for producer to end await Queue.join () # block until queue is not Empty consumer.cancel () # Cancels the consumer task, otherwise it will always block at the Get Method def aio_queue_run (n): loop = Asyncio.get_event_loop () try:loop.run_until _complete (Run (n)) # continues to run the event loop until the task run (n) ends Finally:loop.close () if name = = ' main ': Aio_queue_run (5)
Running the above code, one possible output is as follows:
Produce 1produce 2consume 1produce 3produce 4consume 2produce 5consume 3consume 4consume 5
Crawling pages
Async def page_download (URLs, pages, flag): Url_generator = URLs async with aiohttp. Clientsession () as Session:for URL in url_generator: if Flag.is_set (): break await Asyncio.sleep (uniform ( delay-0.5, delay + 1)) logger.debug (' crawling proxy Web page {0} '. Format (URL)) try: async with Session.get ( URL, headers=headers, timeout=10) as response: page = await response.text () parsed = html.fromstring (decode_ HTML (page)) # Use BS4 to assist lxml decode Web page: http://lxml.de/elementsoup.html#Using only the encoding detection await pages.put (parsed) Url_generator.send (parsed) # Gets the address of the next page according to the current page except stopiteration: Break except Asyncio. Timeouterror: logger.error (' crawling {0} timeout '. Format (URL)) continue # Todo:use a proxy except Exception as E: logger.error (E)
Using the Aiohttp implementation of the page crawl function, most of the proxy site can use the above method to crawl, for the use of JS dynamically generated pages can use selenium control Phantomjs to crawl-the project on the efficiency of the crawler is not high, agent site update frequency is limited , no frequent crawls are required, and PHANTOMJS can be used entirely.
Resolving agents
The simplest is to use XPath to resolve the proxy, using the Chrome browser, directly through the right-click to get the selected page element XPath:
Installing Chrome's extended "XPath Helper" allows you to run and debug XPath directly on the page:
BeautifulSoup does not support XPath and uses lxml to parse the page with the following code:
async def _parse_proxy (self, Rule, page): ips = Page.xpath (Rule.ip_xpath) # A collection of IP addresses of the list type based on XPath parsing ports = Page.xpath (Rule.port_xpath) # The IP address set of the list type is parsed by the XPath if not IPs or not Ports:logger.wa Rning (' {2} crawler could not get IPs (len={0}) or port (Len={1}), please check the xpaths or network '. Format (len (IPS), Len (ports), Rule.rule_name)) Return proxies = map (lambda x, y: ' {0}:{1} '. Format (X.text.strip (), Y.text.strip ()), IPs, ports) if Rule.filters: # filter Proxies According to filter fields, such as "High hide", "Transparent" and so on filters = [] For I, FT in enumerate (rule.filters_x Path): field = Page.xpath (ft) if not field:logger.warning (' {1} crawler could not get {0} field, please check the Filte R XPath '. Format (Rule.filters[i], rule.rule_name)) Continue Filters.append (map (Lambda X:x.text.strip (), field)) filters = Zip (*f ilters) selector = map (lambda x:x = = rule.filters, filters) proxies = compress (proxies, selector) for proxy in Proxies:awa It Self._proxies.put (proxy) # The parsed agent is put into Asyncio. Queue in
Crawler rules
The rules for site crawling, proxy parsing, filtering, and so on are defined by the rules classes of each proxy site, using meta classes and base classes to manage the rule classes. The base class is defined as follows:
Class Crawlerrulebase (object, Metaclass=crawlerrulemeta): Start_url = None Page_count = 0 Urls_format = None next_page_xpa th = None Next_page_host = "Use_phantomjs = False Phantomjs_load_flag = None filters = () Ip_xpath = none Port_xpath = N One filters_xpath = ()
The meanings of each parameter are as follows:
start_url
Required
The start page of the crawler.
ip_xpath
Required
The XPath rule for crawling IP.
port_xpath
Required
The XPath rule that crawls the port number.
page_count
The number of pages crawled.
urls_format
The format string for the page address, using Urls_format.format (Start_url, N), to generate the address of page N, which is a more common form of page address.
next_page_xpath
,next_page_host
The URL of the next page is obtained by the XPath rule (a common relative path), with host getting the next page address: Next_page_host + URL.
use_phantomjs
, phantomjs_load_flag
USE_PHANTOMJS is used to identify whether crawling of the site requires the use of PHANTOMJS, and if so, define Phantomjs_load_flag (an element on the page, str type) as a flag for the PHANTOMJS page to load.
filters
Filters the collection of fields, which can iterate over types. Used to filter proxies.
Crawls the XPath rules for each filter field, corresponding to the filter field in order one by one.
The meta-class Crawlerrulemeta is used to manage the definition of a rule class, such as: If you define Use_phantomjs=true, you must define PHANTOMJS_LOAD_FLAG, otherwise you will throw an exception and not repeat it here.
The rules have been implemented by the West Thorn agent, Fast agent, 360 agents, 66 agents and secret agents. The new rule class is also very simple, by inheriting crawlerrulebase to define the yourruleclass of the rule class, placed in the Proxypool/rules directory, and added from in the init.py in that directory. Import Yourruleclass (so that you can get all the rule classes by crawlerrulebase.subclasses ()), restart the running proxy pool to apply the new rules.
2. Inspection section
Free agent Although many, but not many available, so crawl to the agent needs to be tested, effective agents can be put into the agent pool, and agents are time-sensitive, but also periodically to the pool of agents to verify, timely removal of the failed agent.
This part is very simple, using aiohttp through the proxy to access a website, if the timeout, the agent is invalid.
Async def validate (self, proxies): Logger.debug (' validator started ') and 1:proxy = await proxies.get () async with Aioht Tp. Clientsession () as session: try: real_proxy = ' http:/' + proxy async with Session.get (Self.validate_url, Proxy=real_proxy, Timeout=validate_timeout) as resp: self._conn.put (proxy) except Exception as E: Logger.error (e) Proxies.task_done ()
3. Server section
A Web server is implemented using Aiohttp, and when it is started, access to Http://host:port displays the home page:
Access Http://host:port/get to get 1 agents from the proxy pool, such as: ' 127.0.0.1:1080 ';
Access http://host:port/get/n to get n agents from the agent pool, such as: "[' 127.0.0.1:1080 ', ' 127.0.0.1:443 ', ' 127.0.0.1:80 ']";
Access the Http://host:port/count to get the capacity of the agent pool, such as ' 42 '.
Because the home page is a static HTML page, in order to avoid every request to access the home page to open, read and close the cost of the HTML file, it is cached in Redis, through the HTML file modification time to determine whether it has been modified, If the modification time differs from the Redis cache's modification time, the HTML file is considered modified, the file is read again, and the cache is updated, otherwise the contents of the home page are obtained from Redis.
The return proxy is aiohttp.web.Response(text=ip.decode('utf-8'))
implemented by the text that requires the STR type, and the bytes type obtained from Redis needs to be converted. Returns multiple proxies that can be converted to the list type using Eval.
Return to the home page is different, is through aiohttp.web.Response(body=main_page_cache, content_type='text/html')
, here the body requires the bytes type, directly will be retrieved from the Redis cache, conten_type='text/html'
necessary, otherwise unable to load the home page through the browser, but will download the home page-- It is also important to note that when running the sample code in the official documentation, the sample code is basically not set to Content_Type.
This part is not complicated, pay attention to the above mentioned points, and about the home page using the static resource file path, you can refer to the previous blog "aiohttp add static resource path."
4. Running
The functionality of the entire agent pool is divided into 3 separate parts:
Proxypool
Regular check agent pool capacity, if below the lower limit to start the Agent crawler and agent inspection, through the inspection of the crawler into the agent pool, to reach the specified number of Stop crawler.
Proxyvalidator
Used to periodically verify the agents in the agent pool and remove the failed agents.
ProxyServer
Start the server.
These 3 separate tasks run through 3 processes, and under Linux you can use SUPERVISOD to manage these processes, and here is an example of the Supervisord configuration file:
Because the project itself has been configured with logs, there is no need to use Supervisord to capture stdout and stderr. Starting Supervisord,proxypool and ProxyServer with Supervisord-c supervisord.conf will automatically start, ProxyServer need to be started manually, access HTTP/ 127.0.0.1:9001 can manage these 3 processes through a Web page:
Supervisod's Official document says that the current (version 3.3.1) does not support Python3, but I did not find any problems in the use of the process, probably because I did not use Supervisord complex features, just as a simple process status monitoring and start-stop tool.