Usage of asyncio in python

Source: Internet
Author: User


Since mid-February, I've been looking at Python's aiohttp (or using the asynchronous library to write a reptile). Last weekend, I finally wrote a passable version. Write one's own experience here.

Async/await

This is insignificant in Python's coprocessor. And a few of the posts I wrote in February were a preface to this. One months ago, I was also ignorant of them, very do not understand the process, through the one months of learning, I suddenly enlightened, roughly understand the process of the past and present life.

Async and await, a new syntax introduced in Python3.5, to enhance support for the Python coprocessor. However, there is too little information about them, especially in Chinese. Thank Ipfans <ipfanscn at gmail.com> to translate PEP-0492 into Chinese. Here is the hyperlink: PEP 0492 coroutines with async and await syntax Chinese translation. If the link fails, you can try my copy version: Back up

First knowledge of Asyncio

My own understanding of it is too shallow, only limited to the initial stage of use (this is also the first asynchronous library I use), if there are errors welcome to testify.

Before introducing Asyncio, talk about Future, although this Future is different from the Future in Asyncio. Pep-3184–futures-execute computations asynchronously explained it. Here is a part of my translation:

PEP 3148–futures-performing asynchronous computations

Summary

This PEP presents an easy-to-use evaluation package design for the threads and processes that can be invoked.

Motivation

Python currently has a powerful primitive to build multi-threaded and multiple-process applications, but simple parallel operations require a lot of work. That is: Start the thread/process, build the production/consumption queue, wait for completion or other termination criteria (such as failure, timeout). It is also difficult to design an application with global process/thread limitations when each part has its own parallel execution strategy.

Specification

Named

The recommended package is called "futures" and he will be present in a new "concurrent" top-level package. Futures inventory is in the "concurrent" namespace for several reasons. First, prevent the "from future import X" New syntax from being in conflict with the syntax in Python. In addition, adding the name of the "concurrent" precursor shows what the library is dependent on-that is, concurrency (concurrency)-which eliminates any singularity, as it has noticed that not everyone in society is familiar with Java's futures, or excludes it from the US stock m The Future term in Arket.

Finally, we open up the new namespaces for this standard library-clearly named "Concurrent-." We want to either add or remove existing, dependent libraries from the future to be here. A typical example is multiprocessing. Pool work, and have their plug-ins included in this module, this work covers threads and processes.

Asyncio is the standard library introduced in the Python version 3.4, at least Python3.3 (requires manual installation) to be used. Since the introduction of the yield from 3.3, Python has the basic conditions for running Asyncio. In version 3.5, the introduction of the new syntax makes Python a native coprocessor, not a new type of generator.

This is an introduction to the official documentation:

This module provides the use of coprocessor to write single-threaded concurrent code, multiplexing I/O access, running network clients and servers, and other related infrastructure.

The following is a detailed list of package contents:

A pluggable event loop that can be used for a variety of specific system implementations

The abstraction of Transport and Protocol (similar to Twisted) –twisted is an asynchronous library running under Python

Specific support for TCP, UDP, SSL, child process pipelines, deferred traffic, and others (some may be system-related)

A fulture class that mimics the Concurrent.futures module, but is suitable for event recycling

A process and task implemented on the basis of PEP 380 to write parallel code in a sequential fashion

Provide termination support for Future and coprocessor

In a single thread, use synchronous primitives between the threads to simulate those threading modules

An interface for passing a working thread pool to use when you have to use a library to block I/O calls

[The New Asyncio Module in Python 3.4:event Loops], which is part of my translation of it:

The Asyncio module comprises the following main components:

Events Loop (event loop)

Event loop multiplexing I/O, serialization event handling, and working in a policy pattern that is very flexible to a custom platform and framework. For example, tornado,twisted, as well as gevent, can work with Asyncio or build on the top level of Asyncio. In fact, the event cycle is affected by Tornado and Twisted. In addition, Asyncio has selected the best I/O mechanism for each platform. Unix and LInux work with selectors, while windows-based systems work with IOCP (I/O completion ports abbreviations).

Futures

This is the abstraction that delays the producer. Exceptions are also tasks that are a result. The Asyncio.futures.Future class is similar to the Future introduced in Python3.2, which is described in PEP-3148. That is, the Concurrent.futures.Future class. However, in this case, Future is applicable to the coprocessor, a class that differs from the PEP-3148 described by a different API. The Asyncio module does not apply to the existing Concurrent.futures.Future class because it is designed to work with threads. The module encourages the use of yield from locking the current task to wait for results in the coprocessor, thus avoiding blocking your application. your coprocessor code block; That is, your process is suspended until the result is generated, but the event loop is not blocked. If the same event loop has other sequence of tasks, they may run. When the process produces results, the paused coprocessor resumes, and you can write the same code in the same order. You can read the code without considering the existence of yield from. When you use yield from in a function to return a yield from object, you can forget the special details of Future execution and its specific API. If an exception is generated, such as if you call a function, it does not return Future, but it does the sequential execution, then the exception is thrown. So, writing asynchronous code is the same as writing synchronization code, except adding yield from. If you have experience using Twisted, you will notice that the Twisted has the same effect in the adorner @defer. Inlinecallbacks.

Co-process (Coroutine)

These are values that the generator function can accept, and they must be decorated @coroutine (@asyncio. coroutine). @coroutine adorners indicate that you use yield from to pass each Future. The adorner ensures that whenever you read the code, you will know that the code uses asynchronous mode. In these generator functions you must use yield from. If you are familiar with C # 5.0, you will notice that @coroutine and yield from have the same effect as the keyword async and await in C #.

Tasks

Each task is a Future wrapped process that runs with the event loop running. Asyncio. The Task class is Asyncio. Future subclass of the class. As you may guess, tasks also work with yield from.

Transmission (Transports)

They are equivalent to connections, such as sockets (sockets) and pipelines (pipes).

Protocol (Protocols)

They are equivalent to applications, such as HTTP server, SMTP, and FTP.

There are a lot of lists, which are actually meant to illustrate these things:

In Asyncio, Future is an abstract
Future used in a co-process
The function decorated by the Asyncio.coroutine adorner is a Future class
Asyncio. Task is Asyncio. Future of the child class
Future runs with the event loop
For the new syntax async and await in Python3.5, it can be simply considered as a simplified version of asyncio.coroutine and yield from. With the advent of the new syntax there is also async with EXPR as Var, the asynchronous context Manager (substituted with (yield from EXPR) as Var) and async for Target in ITER (instead of for Target I N (yield from ITER)) this asynchronous iterator. Unlike yield from, await applies only to the native threads of the co_coroutine tag, that is, objects defined using the async def syntax.

So, how do you write a python3.5+ that works for the use of the version? This is a particle in the official document:

#Example of coroutine displaying "Hello world"
Importasyncio

Asyncdefhello_world ():
Print ("Hello world!")

loop = Asyncio.get_event_loop ()
# Blocking call which returns when the Hello_world () Coroutine was done
Loop.run_until_complete (Hello_world ())
Loop.close ()
This code does the following things:

Created a coprocessor (async def), which is a Future object
Create a default event loop (loop = Asyncio.get_event_loop ())
Loop.run_until_complete method to run event loop
Shutdown Event Loops
The Asyncio programming model is a message loop. A coprocessor is implemented by directly obtaining a reference to an event loop from the Asyncio module and throwing the required execution into the event loop.

Queues in the Asyncio

I do not know where this part is good alas, but below the need for it alas, no place to put the ~

The beginning emphasizes one point: the async def tag is always coroutine, Coroutine is a Awaitable object, and await is used in a Awaiteble object.

Suppose you have a certain understanding of the Python queue (it doesn't seem to matter if you don't know).

In the Asyncio module, queues are also divided into three categories: Fifoqueue-first in-First, first-out first-, priorityqueue-priority-, lifoqueue (last in, first, then out). They are not thread safe. Here are a few common methods:

Coroutine get () – Out of line, blocking operation: deleting from the queue and returning a data. If the queue is empty, the operation blocks the thread until the available data appears. This method is a coroutine.
Get_nowait () – Outbound, non-blocking operation: Deletes from the queue and returns a data. If the data exists, it is returned, or a Queueempty exception is thrown.
Coroutine join () – Blocks the current thread until all the data in the queue is processed. After the data is joined, the number of unfinished tasks increases. Any consumer calling Task_done () means that a consumer gets and completes a task, and the number of unfinished tasks decreases. Join () blocking is lifted when the unfinished task count drops to 0. The method is a coroutine.
Coroutine put (item) – Team. If the queue is full, wait until a vacancy is available. The method is a coroutine.
Put_nowait (item) – Non-blocking the team. If the queue is full, throw the queuefull exception.
Task_done () – means that one of the previously queued tasks has been completed. Called by the consumer of the thread. Each get () call gets a task, and the next Task_done () call tells the queue that the task has been processed. If the current join () is blocking the thread, it resumes execution when all the tasks in the queue are processed (that is, each task that is joined by the put () call has a corresponding task_done () call). If the method is called more than the number of tasks placed in the queue, the ValueError exception is thrown.
Next show two chestnuts:

# Example of the the queue how to work
Importasyncio
Fromasyncioimportqueue

Asyncdefwork (q):
Whiletrue:
i = Awaitq.get ()
Try
Print (i)
Print (' Q.qsize (): ', Q.qsize ())
Finally
Q.task_done ()

Asyncdefrun ():
Q = Queue ()
Awaitasyncio.wait ([Q.put (i) Foriinrange (10)])
tasks = [Asyncio.ensure_future (Work (q))]
Print (' Wait join ')
Awaitq.join ()
Print (' End Join ')
Fortaskintasks:
Task.cancel ()

if__name__ = = ' __main__ ':
loop = Asyncio.get_event_loop ()
Loop.run_until_complete (Run ())
Loop.close ()
This is a simple consumer queue, primarily to illustrate the use of functions with coroutine identities (see Asyncio.wait ()). Have you noticed their similarities?

However, the Association agenda both producers and consumers, then what is the situation?

Importasyncio
Fromasyncioimportqueue

Classtest:
Def__init__ (self):
Self.que = Queue ()
Self.pue = Queue ()

Asyncdefconsumer (self):
Whiletrue:
Try
Print (' Consumer ', Awaitself.que.get ())
Finally
Try
Self.que.task_done ()
Exceptvalueerror:
Ifself.que.empty ():
Print ("Que empty")

Asyncdefwork (self):
Whiletrue:
Try
Value = Awaitself.pue.get ()
Print (' producer ', value)
Awaitself.que.put (value)
Finally
Try
Self.pue.task_done ()
Exceptvalueerror:
Ifself.pue.empty ():
Print ("Pue empty")

Asyncdefrun (self):
Awaitasyncio.wait ([Self.pue.put (i) Foriinrange (10)])
tasks = [Asyncio.ensure_future (Self.work ())]
Tasks.append (Asyncio.ensure_future (Self.consumer ()))
Print (' P queue join ')
Awaitself.pue.join ()
Print (' P queue being done & Q queue join ')
Awaitself.que.join ()
Print (' q queue is done ')
Fortaskintasks:
Task.cancel ()

if__name__ = = ' __main__ ':
Print ('----start----')
Case = Test ()
loop = Asyncio.get_event_loop ()
Loop.run_until_complete (Case.run ())
Print ('----end----')
There is no good explanation, just pay attention to the behavior of the Task.cancel (). When the Task.cancel () is invoked, a Cancellederror exception (in the try block) is generated in the next event loop, and why does the exception handling occur, and remember the description of Task_done ()?

To recognize Asyncio again.

With the simple example above, I'm sure you've learned to write a coprocessor. Here I will demonstrate my understanding of Asyncio with a crawler-using reptile. Finally the crawler will achieve this effect: multiple worker to complete the Web page IP agent crawl and detach, and sent to the queue, and some other worker will test the crawled IP. When there are no more tasks to do, the worker is paused, but the paused worker wakes up and works as soon as there is data in the queue that needs work. The program will have a running time, and the program ends immediately after the time.

Put the question first:

How do I write a coprocessor using the Aiohttp library?
How can I crawl multiple web pages simultaneously in one event loop (to achieve multithreading/multi-process effects)?
How does the communication between the threads?
The first question is simple:

Importasyncio
Importaiohttp
Importre

URL = "http://www.ip84.com/gn-http/"

Asyncdeffetch_page (URL):
Asyncwithaiohttp.get (URL) asresponse:
Try
Assertresponse.status ==200
Print ("ok!", Response.url)
Returnawaitresponse.text ()
Exceptassertionerror:
Print (' error! ', Response.url, Response.Status)

Asyncdeffilter_page (URL):
page = awaitfetch_page (URL)
Ifpage:
Pattern = Re.compile (R ' <tr>.*?<td> (. *?) </td>.*?<td> (. *?) </td>.*?<td>.*?</td>.*?<td> (. *?) </td>.*?<td> (. *?) </td>.*?<td> (. *?) </td>.*?<td> (. *?) </td>.*?</tr> ', Re. S
data = Pattern.findall (page)
Foritemindata:
Print (item)

if__name__ = = "__main__":
loop = Asyncio.get_event_loop ()
Foriinrange (1,21):
Loop.run_until_complete (Filter_page (Url+repr (i)))
Loop.close ()
For the second question, first show me the initial scenario:

# only the main function is different
if__name__ = = "__main__":
loop = Asyncio.get_event_loop ()
Foriinrange (1,21,4):
FS = Asyncio.wait ([Filter_page (Url+repr (i+j)) Forjinrange (4)])
Loop.run_until_complete (FS)
Loop.close ()
Asyncio.wait (FS,): FS is a coprocessor list, encapsulated by this function as a task (I don't understand exactly what I'm doing), but one of the chestnuts below may be enlightening.

Actually here, despite the simple plan, the second problem is solved. The communication between the threads is to take into account the need to test IP, get/test IP between different threads. So, this is not a producer/consumer problem!

Importasyncio
Fromasyncioimportqueue
Importaiohttp
Importtime
Importre

Classcrawl:
def__init__ (self, URL, test_url, *, number=10, max_tasks=5):
Self.url = URL
Self.test_url = Test_url
Self.number = number
Self.max_tasks = Max_tasks
Self.url_queue = Queue ()
Self.raw_proxy_queue = Queue ()
Self.session = Aiohttp. Clientsession () # Tips:connection Pool

Asyncdeffetch_page (self, url):
asyncwithaiohttp.get (URL) asresponse:
Try:
Assertresponse.status ==200
 print ("ok!", Response.url)
Returnawaitresponse.text ()
Exceptassertionerror:
 print (' Error! ', Response.url, Response.Status)

Asyncdeffilter_page (self, url):
 page = awaitself.fetch_page (URL)
Ifpage:
 pattern = Re.compile (R ' <tr>.*?<td> (. *?) </td>.*?<td> (. *?) </td>.*?<td>.*?</td>.*?<td> (. *?) </td>.*?<td> (. *?) </td>.*?<td> (. *?) </td>.*?<td> (. *?) </td>.*?</tr> ', Re. S)
 data = pattern.findall (page)
 print (len (data))
Forrawindata:
 item = list (Map ( Lambdaword:word.lower (), raw)
Awaitself.raw_proxy_queue.put ({' IP ': item[0], ' Port ': item[1], ' Anonymous ': item[2 ], ' protocol ': item[3], ' speed ': item[4], ' checking-time ': item[5]} '
Ifnotself.raw_proxy_queue.empty ():
  Print (' ok! raw_proxy_queue size: ', self.raw_proxy_queue.qsize ())

Asyncdefverify_proxy (self, proxy):
 addr = proxy[' protocol '] + '://' + proxy[' IP ' + ': ' +proxy[' Port ']
& Nbsp;conn = Aiohttp. Proxyconnector (PROXY=ADDR)
Try:
 session = aiohttp. Clientsession (connector=conn)
Withaiohttp. Timeout (a):
 start = Time.time ()
Asyncwithsession.get (self.test_url) asresponse:# close connection and Response, otherwise'll tip:unclosed connection and unclosed response
 end = Time.time ()
Try:
Assertres Ponse.status ==200
 print (' Good proxy: {} {}s '. Format (proxy[' IP ', end-start))
except:# Proxyconnectionerror, Httpproxyerror and etc?
 print (' bad proxy: {}, {}, {}s '. Format (proxy[' IP ', Response.Status, End-start))
except:
 print ( ' Timeout {}, Q size: {} '. Format (proxy[' Speed '], self.raw_proxy_queue.qsize ()))
finally:# close sessions when timeout< br>  session.close ()

Asyncdeffetch_worker (self):
Whiletrue:
url = awaitself.url_queue.get ()
Try
Awaitself.filter_page (URL)
Finally
Self.url_queue.task_done ()

Asyncdefverify_worker (self):
Whiletrue:
Raw_proxy = Awaitself.raw_proxy_queue.get ()
ifraw_proxy[' protocol '] = = ' https ': # only HTTP can be used
Continue
Try
Awaitself.verify_proxy (Raw_proxy)
Finally
Try
Self.raw_proxy_queue.task_done ()
Except
Pass

Asyncdefrun (self):
Awaitasyncio.wait ([Self.url_queue.put (Self.url+repr (i+1)) Foriinrange (Self.number)])
Fetch_tasks = [Asyncio.ensure_future (Self.fetch_worker ()) For_inrange (Self.max_tasks)]
Verify_tasks = [Asyncio.ensure_future (Self.verify_worker ()) For_inrange (10*self.max_tasks)]
tasks = Fetch_tasks + verify_tasks
Awaitself.url_queue.join ()
Self.session.close () # Close session, otherwise shows error
Print ("Url_queue done")
Self.raw_proxy_queue.join ()
Print ("Raw_proxy_queue done")
Awaitself.proxy_queue.join ()
Fortaskintasks:
Task.cancel ()

if__name__ = = ' __main__ ':
loop = Asyncio.get_event_loop ()
Crawler = Crawl (' http://www.ip84.com/gn-http/', test_url= ' https://www.baidu.com ')
Loop.run_until_complete (Crawler.run ())
Loop.close ()
Run it and try it.

The problem is here: consumers are too fast. Proxy_queue when the producer is still processing the data, the data is consumed, the join () blocking will be released, but Coroutine still have data in the processing ... So what do we do? The way to think about a task is wait_for ().

At the end of the ~ the answer will not tell you-_-

Hope you can read under an example of Web spider using Aiohttp source code, you can understand.

Tasks in the Asyncio

Task function

Note: Task functions allows the event loop to be set on the underlying task or Cheng. If the event loop is not provided, the default event loop is used.

Coroutine asyncio.wait (FS, *, Loop=none, return_when= ' all_completed ')

Waiting for the futures and the Fs-futures to be completed by the queue-
The futures queue must not be empty
The coprocessor will be wrapped into Tasks
Return two Future-(done, pending)-collection
Timeout can be used to control the longest elapsed time of a coprocessor, in seconds, which can be an integer or a floating-point number. If timeout is not specified or None, the wait time is infinitely long.
Return_when indicates that this function should be returned. It must be a constant in the Concurrent.futures module:
Constant description
First_completed when any one of the future ends or cancels, the function returns
First_exception when any one of the future ends by throwing an exception, the function returns. If no future throws an exception, then it is equivalent to all_completed
All_completed when all futures are closed or canceled, the function returns
This function is a coprocessor
Usage:
Done, pending = yield from asyncio.wait (FS)
Attention:
This does not trigger timeouterror!. When the timeout occurs in the second set, the futures does not end.
Cancel ()

Cancels a task.
It arranges a Cancellederror exception to be thrown into the coroutine in the next event cycle of the events loop. Then Coroutine has the opportunity to clean up and even use try/except/finally to reject the request.
Unlike Future.cancel (), it does not guarantee that the task will be canceled: it may catch an exception and handle it, delay the cancellation of the task, or completely prevent the task from canceling. This task might return a value, or throw a different exception.
The method executes immediately after it is invoked, and cancelled () does not return True (unless the task has been canceled).
Coroutine asyncio.wait_for (fut, timeout, *, Loop=none)

Waits for a Future or Coroutine object to complete within a specified time. If the timeout is None, block the thread until the future completes.
Coroutine will be wrapped into a task.
Returns the return result of Future or coroutine. If the timeout occurs, it cancels the task and throws a Asyncio. Timeouterror. To avoid the task being canceled, wrap it into shield ().
If the wait is canceled, then future Fut will also be canceled.
The function is a coprocessor, using:
result = yield from asyncio.wait_for (fut, 60.0)
Shield () is also a method in the task, but I have not used ~

Some thoughts on the reptiles

From where to see, the association turndown thread is much cheaper, indeed. Although 1000 simultaneous initiates HTTP request,i5-4200u This CPU usage is about 5%. But for a regular match, only 5 of the threads are consuming a large amount of CPU. A regular match is a compute-intensive task, and an HTTP request belongs to an I/O-intensive task. I think: separating I/o-intensive tasks from compute-intensive tasks can improve the efficiency of the crawler. It is said that Redis is suitable for caching queues ~ So the next crawler can do this:

URL http request <----Redis URL queue
Coroutine
HTTP response----> Redis response Queue----> Multiprocessing filter----> MongoDB
A few days ago in the open source community to listen to the employment of seniors said: "The company has a very interesting anti-reptile system, without affecting the normal operation of the situation is not a crawler, but the excessive crawling will be killed." A less-than-excessive reptile can also be hung up after a system upgrade. As for the use of proxies, the global proxy server is so many, and there is a public proxy server blacklist system, these IP is certainly the focus of attention.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.