Python Web data capture full record

Source: Internet
Author: User

Python Web data capture full record

In this article, I'll show you a replacement for a request based on the new Asynchronous Library (Aiohttp). I used it to write some small data crawlers that are really fast, and I'll show you how. The reason for this diversity in codego.net is that data "crawling" actually involves a lot of problems: you don't need to use the same tools to crawl data from thousands of pages while automating some Web workflows (such as filling out forms and retrieving data).




The basic concept of Asyncio

Asyncio is an asynchronous IO library that was introduced in python3.4. You can also install it via python3.3 's pypi. It's quite complicated, and I'm not going to introduce too much detail. Instead, I'll explain what you need to know to use it to write asynchronous code.


Collaborative programs and event loops.

The co-programs are like methods, but they can be paused and resumed at a specific point in the code. When waiting for an IO (such as an HTTP request) while executing another request, it can be used to pause a co-program. We use the keyword yield from to set a state, indicating that we need a return value for a synergistic program. The event loop is used to schedule the execution of a synergistic program.


There's a lot more about Asyncio, but that's what we need to know so far. Maybe you're not sure, so let's take a look at some code.


Aiohttp is a library that leverages Asyncio, and its API looks much like the requested API. So far, the relevant documents introduced in Codego.net are not sound. We use Asyncio.coroutine to decorate a method into a synergistic program. Aiohttp.request is a synergistic program, so it is a readable method and we need to use yield from to invoke them. In addition to these, the following code looks pretty straightforward:

@asyncio. Coroutine def print_page (URL): Response = yield from aiohttp.request (' GET ', url) BODY = yield from RESPONSE.R Ead_and_close (decode=true) print (body)


We can use the yield from to invoke a synergistic program from another synergistic program. In order to invoke a synergistic program from the synchronous code, we need an event loop. We can get a standard event loop through Asyncio.get_event_loop () and then use its run_until_complete () method to run the collaboration program. So, to make the previous co-operation, we just need to do the following:

loop = Asyncio.get_event_loop () loop.run_until_complete (Print_page (' http://example.com '))

A useful method is asyncio.wait, which allows you to get a list of the synergistic programs, and also returns a separate co-program that includes them all, so we can write:

Loop.run_until_complete (Asyncio.wait ([Print_page (' Http://example.com/foo '),

Print_page (' Http://example.com/bar ')]))


The other is asyncio.as_completed, which allows you to get a list of the synergistic programs, and also returns an iterator that builds the orchestration in the completed order, so that when you iterate with it, you get every available result as soon as possible.

Data fetching

Now that we know how to do asynchronous HTTP requests, we can write a data grabber. We just need some tools to read the HTML page, I use BeautifulSoup to do this, and the rest like pyquery or lxml can be implemented.


We'll write a small data crawler to grab some Linux distributions torrent links from Pirate Bay (The Pirate Bay (English: The Pirate Bay, abbreviation: TPB) is a website dedicated to storing, classifying, and searching for BitTorrent seed files, and claiming to be "the world's largest BitTorrent tracker (BT seed server)", providing BT seeds in addition to the collection of free copyright, there are many people who claim to have copyrighted audio, video, applications and video games, etc., for the network to share and download one of the important sites – Translator's note from Wikipedia)

First, a secondary collaboration program is required to obtain the request:

@asyncio. Coroutine def get (*args, **kwargs): Response = yield from aiohttp.request (' Get ', *args, **kwargs) return (Yie LD from Response.read_and_close (decode=true))


Parsing section. This article does not introduce BeautifulSoup, so this Part I will shorthand: we have obtained the first flux of this page.


def first_magnet (page): Soup = bs4. BeautifulSoup (page) a = Soup.find (' A ', title= ' Download this torrent using magnet ') return a[' href ']

In this synergistic program, the result of the URL is sorted by the number of seeds, so the first result is actually the most seed:

6 @asyncio. Coroutine def print_magnet (query): url = ' http://thepiratebay.se/search/{}/0/7/0 '. Format (query) page = Yiel D from Get (URL, compress=true) magnet = first_magnet (page) print (' {}: {} '. Format (query, magnet))


Use the following code to invoke all of the above methods.

distros = [' ArchLinux ', ' Ubuntu ', ' debian '] loop = Asyncio.get_event_loop () f = asyncio.wait ([Print_magnet (d) for D in Dis Tros]) Loop.run_until_complete (f)

Now we have come to this part. You have a small gripper that works asynchronously. This means that multiple pages can be downloaded at the same time, so this example is 3 times times faster than using the same code as the request. Now you should be able to write your own crawler in the same way.


Once you get familiar with all this, I suggest you take a look at Asyncio's documentation and aiohttp examples that will tell you what the potential of Asyncio is.

One limitation of this approach, in fact all manual methods, is that no single library can be used to process the form. Mechanized methods have a lot of assistive tools, which makes it easy to submit forms, but if you don't use them, you'll have to deal with them yourself. This may cause some bugs to appear, so I might write a library like this (but I don't have to worry about it so far).

Additional advice: Do not request too much from the server

It's cool to make 3 requests at the same time, but it's less fun to do 5,000 at a time. If you are going to make too many requests at the same time, the link may break down. You may even be banned from linking the network.

To avoid these, you can use semaphore. This is a synchronization tool that can be used to limit the number of concurrent programs that work concurrently. We just need to create a semaphore before we build the loop, and pass the number of simultaneous requests we want to allow as parameters to it:


SEM = Asyncio. Semaphore (5)


Then we just need to put the following



page = yield from Get (URL, compress=true)


Replace it with the same thing that is protected by semaphore.



With (yield from SEM): page = yield from Get (URL, compress=true)

This guarantees that up to 5 requests will be processed at the same time.

TQDM is an excellent library for generating progress bars. This orchestration works like asyncio.wait, but displays a progress bar representing the completion level.

@asyncio. Coroutine def wait_with_progress (Coros): For F in TQDM.TQDM (asyncio.as_completed (Coros), Total=len (Coros)): Yield from F


This article is from the "Tianajiejue" blog, make sure to keep this source http://10068262.blog.51cto.com/10058262/1627628

Python Web data capture full record

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.