Web data fetching is a topic that often appears in Python discussions. There are many ways to crawl web data, but there seems to be no best way. There are some very mature frameworks such as scrapy, and more are lightweight libraries like mechanize. DIY solutions are also popular: You can use requests, beautifulsoup, or pyquery.
The reason for this diversity is that data "crawling" actually involves a lot of problems: you don't need to use the same tools to crawl data from thousands of pages while automating some Web workflows (such as filling out forms and retrieving data). The reason I like DIY is that it's flexible, but it's not suitable for a lot of data fetching because it requires synchronization, so a lot of requests mean you have to wait a long time.
In this article, I'll show you a replacement for a request based on the new Asynchronous Library (Aiohttp). I used it to write some small data crawlers that are really fast, and I'll show you how.
The basic concept of Asyncio
Asyncio is an asynchronous IO library that was introduced in python3.4. You can also install it via python3.3 's pypi. It's quite complicated, and I'm not going to introduce too much detail. Instead, I'll explain what you need to know to use it to write asynchronous code.
In short, there are two things you need to know: The Synergy program and the event loop. The co-programs are like methods, but they can be paused and resumed at a specific point in the code. When waiting for an IO (such as an HTTP request) while executing another request, it can be used to pause a co-program. We use the keyword yield from to set a state, indicating that we need a return value for a synergistic program. The event loop is used to schedule the execution of a synergistic program.
There's a lot more about Asyncio, but that's what we need to know so far. Maybe you're not sure, so let's take a look at some code.
Aiohttp
Aiohttp is a library that leverages Asyncio, and its API looks much like the requested API. So far, the documentation has not been sound. But here are some very useful examples. We'll show you the basic usage of it.
First, we'll define a synergistic program to get the page and print it out. We use Asyncio.coroutine to decorate a method into a synergistic program. Aiohttp.request is a synergistic program, so it is a readable method and we need to use yield from to invoke them. In addition to these, the following code looks pretty straightforward:
12345 |
@asyncio
.coroutine
def print_page(url):
response
= yield from aiohttp.request(
‘GET‘
, url)
body
= yield from response.read_and_close(decode
=
True
)
print
(body)
|
As you can see, we could use yield from to invoke a synergistic program from another collaborative program. In order to invoke a synergistic program from the synchronous code, we need an event loop. We can get a standard event loop through Asyncio.get_event_loop () and then use its run_until_complete () method to run the collaboration program. So, to make the previous co-operation, we just need to do the following:
12 |
loop = asyncio.get_event_loop() loop.run_until_complete(print_page( ‘http://example.com‘ )) |
A useful method is asyncio.wait, which allows you to get a list of the synergistic programs, and also returns a separate co-program that includes them all, so we can write:
12 |
loop.run_until_complete(asyncio.wait([print_page( ‘http://example.com/foo‘ ), print_page( ‘http://example.com/bar‘ )])) |
The other is asyncio.as_completed, which allows you to get a list of the synergistic programs, and also returns an iterator that builds the orchestration in the completed order, so that when you iterate with it, you get every available result as soon as possible.
Data fetching
Now that we know how to do asynchronous HTTP requests, we can write a data grabber. We just need some tools to read the HTML page, I use BeautifulSoup to do this, and the rest like pyquery or lxml can be implemented.
In this example, we will write a small data crawler to grab some Linux distributions from the Pirate Bay torrent Link (The Pirate Bay (English: The Pirate Bay, abbreviation: TPB) is a dedicated storage, Classify and search the website of BitTorrent seed file, and claim to be "the world's largest BitTorrent tracker (BT seed server)", providing BT seeds in addition to the collection of free copyrights, there are many writers claiming to have copyrighted audio, video, application software and video games, etc. , one of the most important websites to share and download for the Web – translator notes from Wikipedia)
First, a secondary collaboration program is required to obtain the request:
1234 |
@asyncio
.coroutine
def get(
*
args,
*
*
kwargs):
response
= yield from aiohttp.request(
‘GET‘
,
*
args,
*
*
kwargs)
return (
yield from response.read_and_close(decode
=
True
))
|
Parsing section. This article does not introduce BeautifulSoup, so this Part I will shorthand: we have obtained the first flux of this page.
1234 |
def first_magnet (page): soup = bs4. BeautifulSoup (page) a = soup.find ( ' a ' = ' Download this torrent using magnet ' return a[ ' href ' |
In this synergistic program, the result of the URL is sorted by the number of seeds, so the first result is actually the most seed:
123456 |
@asyncio
.coroutine
def print_magnet(query):
url
= ‘http://thepiratebay.se/search/{}/0/7/0‘
.
format
(query)
page
= yield from get(url, compress
=
True
)
magnet
= first_magnet(page)
print
(
‘{}: {}‘
.
format
(query, magnet))
|
Finally, use the following code to invoke all of the above methods.
1234 |
distros = [ Code class= "python string" > ' ArchLinux ' ' Debian ' loop = Code class= "Python Plain" >asyncio.get_event_loop () f = asyncio.wait ([Print_magnet (d) For d in Distros]) loop.run_until_complete (f) |
Conclusion
Well, now we're in this section. You have a small gripper that works asynchronously. This means that multiple pages can be downloaded at the same time, so this example is 3 times times faster than using the same code as the request. Now you should be able to write your own crawler in the same way.
You can find the generated code here and include some additional suggestions.
Once you get familiar with all this, I suggest you take a look at Asyncio's documentation and aiohttp examples that will tell you what the potential of Asyncio is.
One limitation of this approach, in fact all manual methods, is that no single library can be used to process the form. Mechanized methods have a lot of assistive tools, which makes it easy to submit forms, but if you don't use them, you'll have to deal with them yourself. This may cause some bugs to appear, so I might write a library like this (but I don't have to worry about it so far).
Additional advice: Do not beat the server
It's cool to make 3 requests at the same time, but it's less fun to do 5,000 at a time. If you are going to make too many requests at the same time, the link may break down. You may even be banned from linking the network.
To avoid these, you can use semaphore. This is a synchronization tool that can be used to limit the number of concurrent programs that work concurrently. We just need to create a semaphore before we build the loop, and pass the number of simultaneous requests we want to allow as parameters to it:
1 |
sem = asyncio.Semaphore( 5 ) |
Then we just need to put the following
1 |
page = yield from get (url, compress=True) |
Replace it with the same thing that is protected by semaphore.
12 |
with (yield from sem): page = yield from get (url, compress=True) |
This guarantees that up to 5 requests will be processed at the same time.
Additional Advice: progress bar
This stuff is free of charge oh: TQDM is a great library for generating progress bars. This orchestration works like asyncio.wait, but displays a progress bar representing the completion level.
1234 |
@asyncio.coroutine def wait_with_progress(coros): for f in tqdm.tqdm(asyncio.as_completed(coros), total=len(coros)): yield from f |
Python: Fast crawl with Asyncio