Python co-process (learning, want to learn to come in)

Source: Internet
Author: User
Tags switches

The real knowledge of the growth process, like the growth of wheat ear: When the wheat is short, the grain is very fast, the wheat is proud to hold high, but when the wheat mature full, they began to humble, hang-head.
--Montaigne's complete essay on Montaigne

The previous article discusses whether Python multithreading is a problem, has been a number of users of the recognition, of course, there are some different opinions, indicating that the association process than multithreading do not know how much, in the association of Multithreading is a chicken in front. Well, I agree with that, however, the point I discussed in the previous article is not in the multi-threading and co-process comparison, but in IO-intensive programs, multithreading is still in play.

For the co-process, I said that its efficiency is not multi-threaded, but I do not understand the depth of this, so the recent days to reference some information, learning to organize some, here to share only for your reference, if there is a fallacy, please correct me, thank you.

Disclaimer: This article describes the association process is the entry level, the great god please detour, beware of into the pit.

The article thought: This article will introduce the concept of the co-process, then introduce the use of python2.x and 3.x, and finally compare the association with multithreading and introduce the asynchronous crawler module.

The concept of co-process

Co-process, also known as micro-threading, Fiber, English name Coroutine. The function of the process is that when function A is executed, it can be interrupted at any time to execute function B, and then the interrupt continues to execute function a (which is free to switch). But this process is not a function call (no call statement), and this whole process looks like multi-threading, but only one thread executes.

Advantage
    • Execution is extremely efficient because the subroutine switch (function) is not a thread switch and is controlled by the program itself, without the overhead of switching threads. Therefore, compared with multithreading, the more the number of threads, the more obvious advantages of the co-process performance.
    • There is no need for a multi-threaded locking mechanism because there is only one thread, there is no simultaneous write variable conflict, and no lock is required to control shared resources, so the execution is much more efficient.

Description: The process can handle the efficiency of IO-intensive programs, but processing CPU-intensive is not its forte, such as the full use of CPU utilization can be combined with multi-process + coprocessor.

These are just some of the concepts of the process, it may sound more abstract, then I would like to combine the code to say it. Here mainly introduces the application of the Python2 in Python, the support of the association is relatively limited, the yield of the generator is a part but not complete, gevent module is a better implementation, Python3.4 introduced the Asyncio module, can be very good use of the co-process.

python2.x co-process

python2.x Application:

    • Yield
    • Gevent

python2.x in support of the module is not many, gevent is more commonly used, here is a brief introduction of gevent usage.

Gevent

Gevent is the third-party library, through the Greenlet implementation of the process, the basic idea:
When an greenlet encounters an IO operation, such as accessing the network, it automatically switches to the other Greenlet, waits until the IO operation is complete, and then switches back to execution at the appropriate time. Because the IO operation is very time-consuming and often puts the program in a waiting state, with gevent automatically switching the co-process for us, it is guaranteed that there will always be greenlet running, rather than waiting for IO.

Install

Pip Install Gevent
The latest version seems to support Windows, before the test seems to run on Windows ...

Usage

Let's start with a simple reptile example:

#! -*- coding:utf-8 -*-import geventfrom gevent import monkey;monkey.patch_all()import urllib2def get_body(i):    print "start",i    urllib2.urlopen("http://cn.bing.com")    print "end",itasks=[gevent.spawn(get_body,i) for i in range(3)]gevent.joinall(tasks)

Operation Result:

start 0start 1start 2end 2end 0end 1

Note: From the results, the order of execution get_body should first output "start", and then execute to URLLIB2 when the IO Jam, will automatically switch to run the next program (continue to execute get_body output start) until URLLIB2 return the results, Execute end again. In other words, the program does not wait for the URLLIB2 request site to return results, but instead skips directly, waits for execution to complete before returning to get the return value. It is worth mentioning that, in this process, only one thread is executing, so this is not the same as the concept of multithreading.
Switch to multithreaded code to see:

import threadingimport urllib2def get_body(i):    print "start",i    urllib2.urlopen("http://cn.bing.com")    print "end",ifor i in range(3):    t=threading.Thread(target=get_body,args=(i,))    t.start()

Operation Result:

start 0start 1start 2end 1end 2end 0

Note: From the results, multi-threading is the same as the effect of the co-process, which is the function of switching when IO is blocked. The difference is that multithreading switches between threads (switching between threads), which is the context (functions that can be understood as executing). The overhead of switching threads is significantly greater than the cost of switching contexts, so the more threads you have, the more efficient the process is than multithreading. (Guess the switching overhead of multiple processes should be the largest)

Gevent Instructions for use
    • Monkey can make some blocked modules become non-blocking, mechanism: When the IO operation is automatically switched, manual switching can be used Gevent.sleep (0) (the crawler code for this, the same effect can be reached the switch context)
    • Gevent.spawn startup process, parameter is function name, parameter name
    • Gevent.joinall Stop co-process
python3.x co-process

python3.5 the use of the process can be: Python3.5 Association study

In order to test the python3.x application under the virtualenv, I installed the python3.6 environment under the background.
python3.x Application:

    • Asynico + yield from (python3.4)
    • Asynico + await (python3.5)
    • Gevent

Python3.4 later introduced the Asyncio module, can be very good support for the process.

Asynico

The Asyncio is a standard library introduced in Python version 3.4, and is built directly with support for asynchronous IO. The asynchronous operation of Asyncio needs to be done through yield from coroutine.

Usage

Example: (to be used after python3.4 version)

import asyncio@asyncio.coroutinedef test(i):    print("test_1",i)    r=yield from asyncio.sleep(1)    print("test_2",i)loop=asyncio.get_event_loop()tasks=[test(i) for i in range(5)]loop.run_until_complete(asyncio.wait(tasks))loop.close()

Operation Result:

test_1 3test_1 4test_1 0test_1 1test_1 2test_2 3test_2 0test_2 2test_2 4test_2 1

Description: From the running results can be seen, with the gevent to achieve the same effect, but also in the experience of IO operation to switch (so the output test_1, etc. test_1 output and then output test_2). But here I am a little unclear, why is the output of test_1 not executed in order? You can compare the output of the gevent (hopefully the big God can answer it).

Asyncio description

@asyncio. Coroutine to mark a generator as the coroutine type, and then we throw the coroutine into the EventLoop to execute.
Test () prints out the test_1 first, and then the yield from syntax allows us to easily invoke another generator. Because Asyncio.sleep () is also a coroutine, the thread does not wait for asyncio.sleep (), but instead directly interrupts and executes the next message loop. When Asyncio.sleep () returns, the thread can get the return value (here is none) from the yield from, and then executes the next line of statements.
Consider Asyncio.sleep (1) as an IO operation that takes 1 seconds, during which the main thread does not wait, but instead executes the other coroutine that can be executed in eventloop, so concurrent execution can be implemented.

Asynico/await

To simplify and better identify asynchronous IO, new syntax async and await are introduced from Python 3.5 to make Coroutine's code more concise and readable.
Note that async and await are the new syntax for Coroutine, and to use the new syntax, you only need to do a two-step simple substitution:

    • Replace the @asyncio.coroutine with async;
    • Replace the yield from the await.
Usage

Example (python3.5 later version used):

import asyncioasync def test(i):    print("test_1",i)    await asyncio.sleep(1)    print("test_2",i)loop=asyncio.get_event_loop()tasks=[test(i) for i in range(5)]loop.run_until_complete(asyncio.wait(tasks))loop.close()

The results are consistent with the previous operation.
Note: Compared to the previous section, this is just replacing yield from await, @asyncio. Coroutine replaced by async, the rest is unchanged.

Gevent

Same as python2.x usage.

Co-process vs multithreading

If through the above introduction, you already understand the multi-threading and the difference between the process, then I think the test is not necessary. Because when threads become more and more numerous, the main overhead of multithreading is spent on thread switching, and the co-process is switched within one of the threads, so the overhead is much smaller, which is perhaps the fundamental difference between the performance of the two. (personal view)

Asynchronous crawler

Perhaps the friend who cares about the association, most of it is with its crawler (because the association can solve the IO blocking problem well), but I found that the common Urllib, requests can not be used in conjunction with Asyncio, probably because the crawler module itself is synchronous (or I can not find the usage). So what about the need for asynchronous crawlers, and how do we use the co-process? Or how to write an asynchronous crawler?
Give me a couple of scenarios that I understand:

    • Grequests (asynchronous requests module)
    • Crawler Module +gevent (compare recommended this)
    • Aiohttp (This does not seem to be a lot of information, I am not very good at the moment)
    • Asyncio built-in crawler function (This is also more difficult to use)
Federation Pool

Function: Control the number of processes

from bs4 import BeautifulSoupimport requestsimport geventfrom gevent import monkey, poolmonkey.patch_all()jobs = []links = []p = pool.Pool(10)urls = [    ‘http://www.google.com‘,    # ... another 100 urls]def get_links(url):    r = requests.get(url)    if r.status_code == 200:        soup = BeautifulSoup(r.text)        links + soup.find_all(‘a‘)for url in urls:    jobs.append(p.spawn(get_links, url))gevent.joinall(jobs)

Python co-process (learning, want to learn to come in)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.