Python: Using asyncio for fast crawling (1)

Last Update:2014-04-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Web Data Capturing is a topic that often appears in python discussions. There are many ways to capture web data, but it seems that there is no best way. There are some very mature frameworks such as scrapy, and more are lightweight libraries such as mechanic. DIY solutions are also popular: You can use requests, beautifulsoup, or pyquery.

The reason for the diversity of methods is that data "crawling" actually involves many problems: You don't need to use the same tool to capture data from thousands of pages, at the same time, make some Web workflows automated, such as entering some forms and then retrieving data ). I like DIY because of its flexibility, but it is not suitable for capturing large amounts of data. Because requests need to be synchronized, a large number of requests mean you have to wait for a long time.

In this article, I will show you a substitute for requests based on the new asynchronous library aiohttp. I use it to write some small data capture tasks that are indeed fast. I will show you how to do this.

Basic concepts of asyncio

Asyncio is an asynchronous IO library introduced in python3.4. You can also install it through the pypi of python3.3. It is quite complicated, and I will not introduce too many details. Instead, I will explain what you need to know and use it to write asynchronous code.

In short, there are two things you need to know: collaborative programs and event loops. The collaborators are like methods, but they can be paused and resumed at specific points in the code. When you wait for an IO request, such as an HTTP request, and execute another request at the same time, you can pause a coprocessor. We use the keyword yield from to set a state, indicating that we need the return value of a collaborative program. The event loop is used to schedule the execution of the collaborative program.

There are many other asyncios, but we need to know them so far. Maybe you still have some questions, so let's look at some code.

Aiohttp

Aiohttp is a library that uses asyncio. Its API looks like the requested API. So far, the relevant documentation is not complete. But here are some very useful examples. We will demonstrate its basic usage.

First, we define a synergy program to get the page and print it out. We use asyncio. coroutine to decorate a method into a coprocessor. Aiohttp. request is a collaborative program, so it is a readable method. We need to use yield from to call them. In addition, the following code looks quite intuitive:

 
 
  
  @asyncio.coroutine  
  
  def print_page(url):  
  
      response = yield from aiohttp.request('GET', url)  
  
      body = yield from response.read_and_close(decode=True)  
  
      print(body)

As you can see, we can use yield from to call a coprocessor from another coprocessor. To call a coprocessor from the synchronization code, we need an event loop. We can use asyncio. get_event_loop () to get a standard event loop, and then use its run_until_complete () method to run the collaborative program. Therefore, to make the previous collaborative program run, we only need to do the following steps:

 
 
  
  loop = asyncio.get_event_loop()  
  
  loop.run_until_complete(print_page('http://example.com'))

A useful method is asyncio. wait, which can be used to obtain a list of collaborative programs and return a separate collaborative program that includes all of them, so we can write as follows:

 
 
  
  loop.run_until_complete(asyncio.wait([print_page('http://example.com/foo'),  
  
                                        print_page('http://example.com/bar')]))

The other is asyncio. as_completed, which can be used to obtain a list of collaborators and return an iterator that generates the collaborators in the order of completion. Therefore, when you use it for iteration, each available result will be obtained as soon as possible.

Data capture

Now we know how to Implement Asynchronous HTTP requests, so we can write a data capture packet. We only need some tools to read the html page. I used beautifulsoup to do this, and the rest can also be implemented like pyquery or lxml.

In this example, we will write a small data capture token to capture The torrent link of some linux distributions from The Pirate Bay. The Pirate Bay is abbreviated as TPB) it is a website dedicated to storing, classifying and searching Bittorrent seed files, and claims to be "the world's largest BitTorrent trackerBT seed server". In addition to the copyrighted collection, there are also many audios, videos, applications, and electronic games that are copyrighted by authors. They are one of the most important websites for online sharing and download-the Translator's note is from Wikipedia)

First, an auxiliary coordinator is required to obtain the request:

 
 
  
  @asyncio.coroutine  
  
  def get(*args, **kwargs):  
  
      response = yield from aiohttp.request('GET', *args, **kwargs)  
  
      return (yield from response.read_and_close(decode=True))

Resolution section. This article does not introduce beautifulsoup, so I will abbreviated this part: we have obtained the first link of this page.

 
 
  
  def first_magnet(page):  
  
      soup = bs4.BeautifulSoup(page)  
  
      a = soup.find('a', title='Download this torrent using magnet')  
  
      return a['href']

In this collaboration program, the url results are sorted by the number of seeds, so the top result is actually the most seed:

 
 
  
  @asyncio.coroutine  
  
  def print_magnet(query):  
  
      url = 'http://thepiratebay.se/search/{}/0/7/0'.format(query)  
  
      page = yield from get(url, compress=True)  
  
      magnet = first_magnet(page)  
  
      print('{}: {}'.format(query, magnet))

Finally, use the following code to call all the above methods.

 
 
  
  distros = ['archlinux', 'ubuntu', 'debian']  
  
  loop = asyncio.get_event_loop()  
  
  f = asyncio.wait([print_magnet(d) for d in distros])  
  
  loop.run_until_complete(f)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python: Using asyncio for fast crawling (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python: Using asyncio for fast crawling (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support