A tutorial on fast data capture using the Asyncio library in Python3 _python

Source: Internet
Author: User

Web data crawling is a topic that is often seen in python discussions. There are many ways to crawl web data, but there seems to be no best way to do it. There are some very mature frameworks such as scrapy, and more are lightweight libraries like mechanize. DIY solutions are also popular: You can use requests, beautifulsoup, or pyquery to achieve them.

The reason this is so diverse is that the data "crawl" actually involves a lot of problems: you don't need to use the same tools to grab data from thousands of pages and automate some Web workflows (such as filling out some forms and retrieving data). The reason I like DIY is that it's flexible, but it's not good for a lot of data to crawl because it requires synchronization, so a lot of requests means you have to wait a long time.

In this article, I'll show you a replacement for a request based on a new asynchronous library (Aiohttp). I used it to write small data crawlers that were really fast, and I'm going to show you how to do that.

The basic concept of Asyncio
Asyncio is an asynchronous IO library that was introduced in python3.4. You can also install it by python3.3 the PyPI. It's quite complicated, and I'm not going to introduce too much detail. Instead, I'll explain what you need to know to use it to write asynchronous code.

In short, there are two things you need to know: collaborative programs and event loops. Collaborative programs are like methods, but they can be paused and continued at specific points in your code. When waiting for an IO (such as an HTTP request) and executing another request at the same time, it can be used to suspend a collaborative program. We use the keyword yield from to set a state, indicating that we need a return value for the collaboration program. And the event loop is used to arrange the execution of the cooperative program.

There are a lot more about Asyncio, but these are the things we need to know so far. Maybe you're not sure yet, so let's take a look at some code.

Aiohttp
Aiohttp is a library that utilizes Asyncio, and its APIs look much like the requested API. So far, the relevant documents are not sound. But here are some very useful examples. We will demonstrate the basic usage of it.

First, we'll define a collaborative program to get the page and print it out. We use Asyncio.coroutine to decorate a method as a collaborative program. Aiohttp.request is a collaborative program, so it's a readable method and we need to use yield from to invoke them. In addition to these, the following code looks fairly intuitive:

@asyncio. Coroutine
def print_page (URL):
  response = yield from aiohttp.request ("get", url) body
  = yield from Response.read_and_close (decode=true)
  print (body)

As you can see, we could use yield from to invoke a collaborative program from another collaboration program. In order to invoke a collaborative program from the synchronization code, we need an event loop. We can get a standard event loop via Asyncio.get_event_loop () and then use its run_until_complete () method to run the collaboration program. So, in order for the previous collaborative program to run, we just need to do the following steps:

loop = Asyncio.get_event_loop ()
loop.run_until_complete (print_page (' http://example.com '))

A useful approach is to asyncio.wait, through which you can get a list of collaborative programs, and return a separate collaborative program that includes all of them, so we can write this:

Loop.run_until_complete (asyncio.wait) ([Print_page (' Http://example.com/foo '),
                   print_page (' Http://example.com/bar ')])

The other is asyncio.as_completed, which allows you to get a list of collaborative programs while returning an iterator that generates the collaboration program in the order of completion, so that when you iterate with it, you get every available result as quickly as possible.

Data crawl
now we know how to do asynchronous HTTP requests, so we can write a data crawler. We just need some tools to read the HTML page, I use BeautifulSoup to do it, and the rest like pyquery or lxml can be implemented.

In this example, we'll write a small data crawler to grab some Linux distributions torrent links from the Pirate Bay (English: The Pirate Bay, abbreviation: TPB) is a specialized storage, Classify and search the website of BitTorrent seed files, and claim to be "the world's largest BitTorrent tracker (BT seed server)", the provision of BT seeds in addition to the collection of free copyright, there are many authors claimed to have copyright audio, video, application software and video games, etc. , one of the most important websites to share and download online-translators from Wikipedia.

First, an auxiliary collaboration program is needed to obtain the request:

@asyncio. Coroutine
def get (*args, **kwargs):
  response = yield from aiohttp.request (' Get ', *args, **kwargs) Return
  (yield from Response.read_and_close (decode=true))

Parsing section. This article is not about BeautifulSoup, so I'll abbreviate this part: We've got the first flux for this page.

def first_magnet (page):
  soup = bs4. BeautifulSoup (page)
  a = Soup.find (' A ', title= ' Download this torrent using magnet ') return
  a[' href ']

In this collaborative program, the result of the URL is sorted by the number of seeds, so the first result is actually the most seed:

@asyncio. Coroutine
def print_magnet (query):
  url = ' http://thepiratebay.se/search/{}/0/7/0 '. Format (query)
  page = yield from Get (URL, compress=true)
  magnet = first_magnet (page)
  print (' {}: {} '. Format (query, Magnet ))

Finally, use the following code to invoke all of the above methods.

distros = [' ArchLinux ', ' Ubuntu ', ' Debian ']
loop = Asyncio.get_event_loop ()
f = asyncio.wait ([Print_magnet (d For d in Distros])
Loop.run_until_complete (f)

Conclusion
Okay, now we're in this part. You have a small gripper that works asynchronously. This means that multiple pages can be downloaded at the same time, so this example is 3 times times faster than the same code that uses the request. Now you should be able to write your own gripper in the same way.

You can find the generated code here and include some additional suggestions.

Once you are familiar with all this, I suggest you take a look at Asyncio's documentation and aiohttp examples that will tell you what potential Asyncio has.

One limitation of this approach (in fact all manual methods) is that no single library can be used to process forms. Mechanized methods have a lot of assistive tools, which makes it easy to submit forms, but if you don't use them, you'll have to deal with them yourself. This may cause some bugs to appear, so I might write a library like this (but so far there's no need to worry about it).

Extra advice: Don't knock on the server
It's cool to make 3 requests at the same time, but doing 5,000 at the same time is less fun. If you plan to make too many requests at the same time, the link may be broken. You may even be banned from linking to the web.

To avoid these, you can use semaphore. This is a synchronization tool that can be used to limit the number of concurrent programs that work simultaneously. We just need to create a semaphore before we set up the loop, and pass it as a parameter to the number of simultaneous requests we want to allow:

SEM = Asyncio. Semaphore (5)

Then, we just need to put the following

page = yield from Get (URL, compress=true)

Replace it with the same thing protected by semaphore.

With [yield from SEM]:
  page = yield from Get (URL, compress=true)

This ensures that up to 5 requests are processed at the same time.

Additional Suggestions: progress bar
This stuff is free. Oh: TQDM is an excellent library for generating progress bars. This collaborative program works like Asyncio.wait, but displays a progress bar that represents the degree of completion.

@asyncio. Coroutine
def wait_with_progress (Coros):
  for F in Tqdm.tqdm (asyncio.as_completed (Coros), Total=len (Coros)):
    yield from F

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.