Python has the Asyncio and aiohttp in the crawler this type of IO task multi-threaded/multi-process still exist necessary?

Last Update:2017-07-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I've been learning about asynchronous programming in Python, and after reading some blogs, I've done some quizzes: comparing the differences in efficiency between asyncio+aiohttp crawlers and asyncio+aiohttp+concurrent.futures (thread pool/process pool), Note: I almost do not use any computational tasks in the Crawler, in order to detect the performance of asynchronous, all just do the network IO request, that is, aiohttp the page get finished on the program is done.

The results showed that the former was more efficient than the latter. I asked another blogger, (the blogger who provided the code did not return my message), and he said to use concurrent.futures because I was all IO tasks, and if I dispersed these IO tasks to the thread pool/process pools, instead of multiple threads/ Switching overhead between multiple processes can also reduce the efficiency of the crawler. I thought about it.

Then my question is: only in the process of crawling Web pages, is the Request.get part , multithreading is definitely not necessary, because Gil this big pit, process pool may be better, but performance is not as asynchronous crawler, and more waste of resources. In this case, is not later in the Crawling crawl page stage we can completely use the rise of the asyncio+aiohttp instead . (and other IO tasks such as database/file read/write)

of course, in the data processing phase or to use multi-process, but I think multithreading is completely useless , originally it compared to the advantages of multi-process is IO-type task, it seems that its advantages are completely asynchronous replaced. (Of course the problem is built without considering compatibility 2.x)

Note: There is an additional problem is, see some blog said requests library does not support asynchronous programming is what meaning , in order to fully send back the advantages of async should use Aiohttp, I have not seen requests source code, But some results show that aiohttp performance is really better, can you explain?

Code

Asyncio+aiohttp

import aiohttpasync def fetch_async(a):    async with aiohttp.request(‘GET‘, URL.format(a)) as r:        data = await r.json()    return data[‘args‘][‘a‘]    start = time.time()event_loop = asyncio.get_event_loop()tasks = [fetch_async(num) for num in NUMBERS]results = event_loop.run_until_complete(asyncio.gather(*tasks))for num, result in zip(NUMBERS, results):    print(‘fetch({}) = {}‘.format(num, result))

asyncio+aiohttp+ Line Pool 1 seconds slower.

  Async def Fetch_async (a): Async with Aiohttp.request (' GET ', Url.format (a)) as R:data = await R.json (    Return a, data[' args '] [' A ']def Sub_loop (numbers): loop = Asyncio.new_event_loop () Asyncio.set_event_loop (Loop) tasks = [Fetch_async (num) for NUM in numbers] results = Loop.run_until_complete (Asyncio.gather (*tasks)) for Num, result in Results:print (' Fetch ({}) = {} '. Format (num, result)) Async def run (executor, numbers): Await asyncio.ge T_event_loop (). Run_in_executor (executor, Sub_loop, numbers) def chunks (l, size): n = Math.ceil (len (l)/size) for I I N Range (0, Len (l), N): Yield l[i:i + n] event_loop = asyncio.ge T_event_loop () tasks = [Run (executor, chunked) for chunked in chunks (NUMBERS, 3)]results = Event_loop.run_until_complete ( Asyncio.gather (*tasks)) print (' Use Asyncio+aiohttp+threadpoolexecutor cost: {} '. Format (Time.time ()-start))

The traditional requests + threadpoolexecutor is 3 times times slower than the above.

import timeimport requestsfrom concurrent.futures import ThreadPoolExecutorNUMBERS = range(12)URL = ‘http://httpbin.org/get?a={}‘def fetch(a):    r = requests.get(URL.format(a))    return r.json()[‘args‘][‘a‘]start = time.time()with ThreadPoolExecutor(max_workers=3) as executor:    for num, result in zip(NUMBERS, executor.map(fetch, NUMBERS)):        print(‘fetch({}) = {}‘.format(num, result))print(‘Use requests+ThreadPoolExecutor cost: {}‘.format(time.time() - start))

Add

The above problem is based on CPython, as for I like to use multi-threading, do not like the style of the co-process of the answer is obviously not the topic of discussion. My main question is:
If Python doesn't take Gil, I think the ideal model for the future should be multi-process + co-asyncio+aiohttp. Uvloop and Sanic and 500lines a reptile project has begun to do so. Do not discuss compatibility, the above view is correct, there are some scenarios can not replace multi-threaded.

Async has a lot of scenarios, twisted, tornado, etc. have their own solution, the problem is based on the asyncio+aiohttp of the process of asynchronous.

There is one more question I would like to ask you netizens

Python has the Asyncio and aiohttp in the crawler this type of IO task multi-threaded/multi-process still exist necessary? >> node. js

The answer is quite clear:
Http://www.goodpm.net/postreply/node.js/1010000007987098/Python with Asyncio and aiohttp in the crawler this type of IO task multi-threaded multiple processes still exist the need for it. html

Python has the Asyncio and aiohttp in the crawler this type of IO task multi-threaded/multi-process still exist necessary?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python has the Asyncio and aiohttp in the crawler this type of IO task multi-threaded/multi-process still exist necessary?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python has the Asyncio and aiohttp in the crawler this type of IO task multi-threaded/multi-process still exist necessary?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support