In Python3, we used the asyncio library for quick data capturing, python3asyncio.

Source: Internet
Author: User

In Python3, we used the asyncio library for quick data capturing, python3asyncio.

Web Data Capturing is a topic that often appears in python discussions. There are many ways to capture web data, but it seems that there is no best way. There are some very mature frameworks such as scrapy, and more are lightweight libraries such as mechanic. DIY solutions are also popular: You can use requests, beautifulsoup, or pyquery.

The reason for the diversity of methods is that data "crawling" actually involves many problems: You don't need to use the same tool to capture data from thousands of pages, at the same time, make some Web workflows automated (for example, fill in some forms and then retrieve data ). I like DIY because of its flexibility, but it is not suitable for capturing large amounts of data. Because requests need to be synchronized, a large number of requests mean you have to wait for a long time.

In this article, I will show you a substitute for requests based on the new asynchronous Library (aiohttp. I use it to write some small data capture tasks that are indeed fast. I will show you how to do this.

Basic concepts of asyncio
Asyncio is an asynchronous IO library introduced in python3.4. You can also install it through the pypi of python3.3. It is quite complicated, and I will not introduce too many details. Instead, I will explain what you need to know and use it to write asynchronous code.

In short, there are two things you need to know: collaborative programs and event loops. The collaborators are like methods, but they can be paused and resumed at specific points in the code. When waiting for an IO (such as an HTTP request) and executing another request at the same time, it can be used to suspend a coprocessor. We use the keyword yield from to set a state, indicating that we need the return value of a collaborative program. The event loop is used to schedule the execution of the collaborative program.

There are many other asyncios, but we need to know them so far. Maybe you still have some questions, so let's look at some code.

Aiohttp
Aiohttp is a library that uses asyncio. Its API looks like the requested API. So far, the relevant documentation is not complete. But here are some very useful examples. We will demonstrate its basic usage.

First, we define a synergy program to get the page and print it out. We use asyncio. coroutine to decorate a method into a coprocessor. Aiohttp. request is a collaborative program, so it is a readable method. We need to use yield from to call them. In addition, the following code looks quite intuitive:
 

@asyncio.coroutinedef print_page(url):  response = yield from aiohttp.request('GET', url)  body = yield from response.read_and_close(decode=True)  print(body)

As you can see, we can use yield from to call a coprocessor from another coprocessor. To call a coprocessor from the synchronization code, we need an event loop. We can use asyncio. get_event_loop () to get a standard event loop, and then use its run_until_complete () method to run the collaborative program. Therefore, to make the previous collaborative program run, we only need to do the following steps:
 

loop = asyncio.get_event_loop()loop.run_until_complete(print_page('http://example.com'))

A useful method is asyncio. wait, which can be used to obtain a list of collaborative programs and return a separate collaborative program that includes all of them, so we can write as follows:
 

loop.run_until_complete(asyncio.wait([print_page('http://example.com/foo'),                   print_page('http://example.com/bar')]))

The other is asyncio. as_completed, which can be used to obtain a list of collaborators and return an iterator that generates the collaborators in the order of completion. Therefore, when you use it for iteration, each available result will be obtained as soon as possible.

Data capture
Now we know how to Implement Asynchronous HTTP requests, so we can write a data capture packet. We only need some tools to read the html page. I used beautifulsoup to do this, and the rest can also be implemented like pyquery or lxml.

In this example, we will write a small data capture operator to capture torrent links of some linux distributions from The Pirate Bay) is a website dedicated to storing, classifying and searching Bittorrent seed files, and claims to be "the world's largest BitTorrent tracker (BT seed server )", in addition to the free copyright collection, the BT seeds provided also contain audio, video, application software, and electronic games that have been claimed by authors, one of the important websites for online sharing and download-the Translator's note is from Wikipedia)

First, an auxiliary coordinator is required to obtain the request:
 

@asyncio.coroutinedef get(*args, **kwargs):  response = yield from aiohttp.request('GET', *args, **kwargs)  return (yield from response.read_and_close(decode=True))

Resolution section. This article does not introduce beautifulsoup, so I will abbreviated this part: we have obtained the first link of this page.
 

def first_magnet(page):  soup = bs4.BeautifulSoup(page)  a = soup.find('a', title='Download this torrent using magnet')  return a['href']

In this collaboration program, the url results are sorted by the number of seeds, so the top result is actually the most seed:
 

@asyncio.coroutinedef print_magnet(query):  url = 'http://thepiratebay.se/search/{}/0/7/0'.format(query)  page = yield from get(url, compress=True)  magnet = first_magnet(page)  print('{}: {}'.format(query, magnet))

Finally, use the following code to call all the above methods.
 

distros = ['archlinux', 'ubuntu', 'debian']loop = asyncio.get_event_loop()f = asyncio.wait([print_magnet(d) for d in distros])loop.run_until_complete(f)

Conclusion
Now we are here. You have a small catch for asynchronous work. This means that multiple pages can be downloaded at the same time, so this example is three times faster than the same code used in the request. Now you should be able to write your own capture in the same way.

You can find the generated code here, as well as some additional suggestions.

Once you are familiar with all this, I suggest you take a look at the asyncio documentation and aiohttp examples, which can tell you what potential asyncio has.

One of the limitations of this method (in fact all manual methods) is that no independent library can be used to process forms. Mechanized methods have many auxiliary tools, which makes it easy to submit forms, but if you do not use them, you will have to handle them yourself. This may cause some bugs, so I may write such a library at the same time (but so far there is no need to worry about it ).

Extra suggestion: Do not hit the server
It's cool to make three requests at the same time, but it's not so fun to make 5000 requests at the same time. If you plan to make too many requests at the same time, the link may be broken. You may even be banned from connecting to the network.

To avoid this, you can use semaphore. This is a synchronization tool that can be used to limit the number of concurrent programs. We only need to create a semaphore before creating a loop, and pass the number of requests we want to allow as a parameter to it:
 

sem = asyncio.Semaphore(5)

Then, we only need
 

page = yield from get(url, compress=True)

Replace it with the same things protected by semaphore.
 

with (yield from sem):  page = yield from get(url, compress=True)

This ensures that up to five requests are processed at the same time.

Additional suggestion: progress bar
This is free of charge: tqdm is an excellent library for generating progress bars. This collaborative program works like asyncio. wait, but a progress bar representing the degree of completion is displayed.
 

@asyncio.coroutinedef wait_with_progress(coros):  for f in tqdm.tqdm(asyncio.as_completed(coros), total=len(coros)):    yield from f

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.