Web Data Capturing is a topic that often appears in python discussions. There are many ways to capture web data, but it seems that there is no best way. There are some very mature frameworks such as scrapy, and more are lightweight libraries such as mechanic. DIY solutions are also popular: You can use requests, beautifulsoup, or pyquery.
The reason for the diversity of methods is that data "crawling" actually involves many problems: You don't need to use the same tool to capture data from thousands of pages, at the same time, make some Web workflows automated, such as entering some forms and then retrieving data ). I like DIY because of its flexibility, but it is not suitable for capturing large amounts of data. Because requests need to be synchronized, a large number of requests mean you have to wait for a long time.
In this article, I will show you a substitute for requests based on the new asynchronous library aiohttp. I use it to write some small data capture tasks that are indeed fast. I will show you how to do this.
Basic concepts of asyncio
Asyncio is an asynchronous IO library introduced in python3.4. You can also install it through the pypi of python3.3. It is quite complicated, and I will not introduce too many details. Instead, I will explain what you need to know and use it to write asynchronous code.
In short, there are two things you need to know: collaborative programs and event loops. The collaborators are like methods, but they can be paused and resumed at specific points in the code. When you wait for an IO request, such as an HTTP request, and execute another request at the same time, you can pause a coprocessor. We use the keyword yield from to set a state, indicating that we need the return value of a collaborative program. The event loop is used to schedule the execution of the collaborative program.
There are many other asyncios, but we need to know them so far. Maybe you still have some questions, so let's look at some code.
Aiohttp
Aiohttp is a library that uses asyncio. Its API looks like the requested API. So far, the relevant documentation is not complete. But here are some very useful examples. We will demonstrate its basic usage.
First, we define a synergy program to get the page and print it out. We use asyncio. coroutine to decorate a method into a coprocessor. Aiohttp. request is a collaborative program, so it is a readable method. We need to use yield from to call them. In addition, the following code looks quite intuitive:
- @asyncio.coroutine
- def print_page(url):
- response = yield from aiohttp.request('GET', url)
- body = yield from response.read_and_close(decode=True)
- print(body)
As you can see, we can use yield from to call a coprocessor from another coprocessor. To call a coprocessor from the synchronization code, we need an event loop. We can use asyncio. get_event_loop () to get a standard event loop, and then use its run_until_complete () method to run the collaborative program. Therefore, to make the previous collaborative program run, we only need to do the following steps:
- loop = asyncio.get_event_loop()
- loop.run_until_complete(print_page('http://example.com'))
A useful method is asyncio. wait, which can be used to obtain a list of collaborative programs and return a separate collaborative program that includes all of them, so we can write as follows:
- loop.run_until_complete(asyncio.wait([print_page('http://example.com/foo'),
- print_page('http://example.com/bar')]))
The other is asyncio. as_completed, which can be used to obtain a list of collaborators and return an iterator that generates the collaborators in the order of completion. Therefore, when you use it for iteration, each available result will be obtained as soon as possible.
Data capture
Now we know how to Implement Asynchronous HTTP requests, so we can write a data capture packet. We only need some tools to read the html page. I used beautifulsoup to do this, and the rest can also be implemented like pyquery or lxml.
In this example, we will write a small data capture token to capture The torrent link of some linux distributions from The Pirate Bay. The Pirate Bay is abbreviated as TPB) it is a website dedicated to storing, classifying and searching Bittorrent seed files, and claims to be "the world's largest BitTorrent trackerBT seed server". In addition to the copyrighted collection, there are also many audios, videos, applications, and electronic games that are copyrighted by authors. They are one of the most important websites for online sharing and download-the Translator's note is from Wikipedia)
First, an auxiliary coordinator is required to obtain the request:
- @asyncio.coroutine
- def get(*args, **kwargs):
- response = yield from aiohttp.request('GET', *args, **kwargs)
- return (yield from response.read_and_close(decode=True))
Resolution section. This article does not introduce beautifulsoup, so I will abbreviated this part: we have obtained the first link of this page.
- def first_magnet(page):
- soup = bs4.BeautifulSoup(page)
- a = soup.find('a', title='Download this torrent using magnet')
- return a['href']
In this collaboration program, the url results are sorted by the number of seeds, so the top result is actually the most seed:
- @asyncio.coroutine
- def print_magnet(query):
- url = 'http://thepiratebay.se/search/{}/0/7/0'.format(query)
- page = yield from get(url, compress=True)
- magnet = first_magnet(page)
- print('{}: {}'.format(query, magnet))
Finally, use the following code to call all the above methods.
- distros = ['archlinux', 'ubuntu', 'debian']
- loop = asyncio.get_event_loop()
- f = asyncio.wait([print_magnet(d) for d in distros])
- loop.run_until_complete(f)