Python tornado queue example-a concurrent web crawler code sharing, pythontornado

Source: Internet
Author: User

Python tornado queue example-a concurrent web crawler code sharing, pythontornado

Queue

The Tornado. queue module of tornado implements an asynchronous producer/consumer queue for coroutine-based applications. This is similar to the queue module implemented by the python standard library in a multi-threaded environment.

Executing a coroutine to yieldqueue. get will pause until there are entries in the queue. If there is a maximum number of queue, the execution of yieldqueue. put by a coroutine will be paused until there is idle position in the queue.

The reference count of an unfinished task is maintained in a queue. Each put operation is called to increase the reference count, and the task_done operation is called to reduce the reference count.

The following is an example of a simple web crawler:

At first, the queue contains only one reference url. After a worker extracts a url from it, it parses the url contained in the corresponding page and puts it into the queue, and then calls task_done to reduce the reference count once.

Finally, the worker will retrieve a url, and all the URLs on this url page have been processed, and there is no url in the queue. When task_done is called, the reference count is reduced to 0.

In this way, the join operation will be suspended and the main coroutine will be terminated.

This crawler uses HTMLParse to parse html pages.

Import timefrom datetime import timedeltatry: from HTMLParser import HTMLParser from urlparse import urljoin, urldefragexcept ImportError: from html. parser import HTMLParser from urllib. parse import urljoin, urldefragfrom tornado import httpclient, gen, ioloop, queuesbase_url = 'HTTP: // response = 10@gen.coroutinedef get_links_from_url (url): "Download the page' Url 'and parse it for links. returned links have had the fragment after '# 'removed, and have been made absolute so, e.g. the URL 'gen.html # tornado. gen. coroutine 'becomes' http: // www.tornadoweb.org/en/stable/gen.html '. "" try: response = yield httpclient. asyncHTTPClient (). fetch (url) print ('fetched % s' % url) html = response. body if isinstance (response. body, str) \ else response. body. decode () urls = [Urljoin (url, remove_fragment (new_url) for new_url in get_links (html)] failed t Exception as e: print ('exception: % s % s' % (e, url) raise gen. return ([]) raise gen. return (urls) # used to extract a real url from a url containing a clip. def remove_fragment (url): pure_url, frag = urldefrag (url) return pure_urldef get_links (html): class URLSeeker (HTMLParser): def _ init _ (self): HTMLParser. _ init _ (self) self. urls = [] # extract hre from all a tags F attribute. Def handle_starttag (self, tag, attrs): href = dict (attrs ). get ('href ') if href and tag = 'A': self. urls. append (href) url_seeker = URLSeeker () url_seeker.feed (html) return url_seeker.urls@gen.coroutinedef main (): q = queues. queue () start = time. time () fetching, fetched = set (), set () @ gen. coroutine def fetch_url (): current_url = yield q. get () try: if current_url in fetching: return print ('fetching % s' % current_url) fetching. add (current_url) urls = yield get_links_from_url (current_url) fetched. add (current_url) for new_url in urls: # Only follow links beneath the base URL if new_url.startswith (base_url): yield q. put (new_url) finally: q. task_done () @ gen. coroutine def worker (): while True: yield fetch_url () q. put (base_url) # Start workers, then wait for the work queue to be empty. for _ in range (concurrency): worker () yield q. join (timeout = timedelta (seconds = 300) assert fetching = fetched print ('done in % d seconds, fetched % s URLs. '% (time. time ()-start, len (fetched) if _ name _ = '_ main _': import logging. basicConfig () io_loop = ioloop. IOLoop. current () io_loop.run_sync (main)

Summary

As described above, the introduction and examples from the Tornado official website's user guide, this student carried out a simple translation and then took the code. The time was a little hasty. I didn't install tornado or test the code in this section, so there was no result demonstration. Please forgive me.

This is the introduction to the Python tornado queue example-a concurrent web crawler code sharing, and I hope to help you. If you are interested, you can continue to refer to other related topics on this site. If you have any shortcomings, please leave a message. Thank you for your support!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.