I've been learning about asynchronous programming in Python, and after reading some blogs, I've done some quizzes: comparing the differences in efficiency between asyncio+aiohttp crawlers and asyncio+aiohttp+concurrent.futures (thread pool/process pool), Note: I almost do not use any computational tasks in the Crawler, in order to detect the performance of asynchronous, all just do the network IO request, that is, aiohttp the page get finished on the program is done.
The results showed that the former was more efficient than the latter. I asked another blogger, (the blogger who provided the code did not return my message), and he said to use concurrent.futures because I was all IO tasks, and if I dispersed these IO tasks to the thread pool/process pools, instead of multiple threads/ Switching overhead between multiple processes can also reduce the efficiency of the crawler. I thought about it.
Then my question is: only in the process of crawling Web pages, is the Request.get part , multithreading is definitely not necessary, because Gil this big pit, process pool may be better, but performance is not as asynchronous crawler, and more waste of resources. In this case, is not later in the Crawling crawl page stage we can completely use the rise of the asyncio+aiohttp instead . (and other IO tasks such as database/file read/write)
of course, in the data processing phase or to use multi-process, but I think multithreading is completely useless , originally it compared to the advantages of multi-process is IO-type task, it seems that its advantages are completely asynchronous replaced. (Of course the problem is built without considering compatibility 2.x)
Note: There is an additional problem is, see some blog said requests library does not support asynchronous programming is what meaning , in order to fully send back the advantages of async should use Aiohttp, I have not seen requests source code, But some results show that aiohttp performance is really better, can you explain?
Code
Asyncio+aiohttp
import aiohttpasync def fetch_async(a): async with aiohttp.request(‘GET‘, URL.format(a)) as r: data = await r.json() return data[‘args‘][‘a‘] start = time.time()event_loop = asyncio.get_event_loop()tasks = [fetch_async(num) for num in NUMBERS]results = event_loop.run_until_complete(asyncio.gather(*tasks))for num, result in zip(NUMBERS, results): print(‘fetch({}) = {}‘.format(num, result))
asyncio+aiohttp+ Line Pool 1 seconds slower.
Async def Fetch_async (a): Async with Aiohttp.request (' GET ', Url.format (a)) as R:data = await R.json ( Return a, data[' args '] [' A ']def Sub_loop (numbers): loop = Asyncio.new_event_loop () Asyncio.set_event_loop (Loop) tasks = [Fetch_async (num) for NUM in numbers] results = Loop.run_until_complete (Asyncio.gather (*tasks)) for Num, result in Results:print (' Fetch ({}) = {} '. Format (num, result)) Async def run (executor, numbers): Await asyncio.ge T_event_loop (). Run_in_executor (executor, Sub_loop, numbers) def chunks (l, size): n = Math.ceil (len (l)/size) for I I N Range (0, Len (l), N): Yield l[i:i + n] event_loop = asyncio.ge T_event_loop () tasks = [Run (executor, chunked) for chunked in chunks (NUMBERS, 3)]results = Event_loop.run_until_complete ( Asyncio.gather (*tasks)) print (' Use Asyncio+aiohttp+threadpoolexecutor cost: {} '. Format (Time.time ()-start))
The traditional requests + threadpoolexecutor is 3 times times slower than the above.
import timeimport requestsfrom concurrent.futures import ThreadPoolExecutorNUMBERS = range(12)URL = ‘http://httpbin.org/get?a={}‘def fetch(a): r = requests.get(URL.format(a)) return r.json()[‘args‘][‘a‘]start = time.time()with ThreadPoolExecutor(max_workers=3) as executor: for num, result in zip(NUMBERS, executor.map(fetch, NUMBERS)): print(‘fetch({}) = {}‘.format(num, result))print(‘Use requests+ThreadPoolExecutor cost: {}‘.format(time.time() - start))
Add
The above problem is based on CPython, as for I like to use multi-threading, do not like the style of the co-process of the answer is obviously not the topic of discussion. My main question is:
If Python doesn't take Gil, I think the ideal model for the future should be multi-process + co-asyncio+aiohttp. Uvloop and Sanic and 500lines a reptile project has begun to do so. Do not discuss compatibility, the above view is correct, there are some scenarios can not replace multi-threaded.
Async has a lot of scenarios, twisted, tornado, etc. have their own solution, the problem is based on the asyncio+aiohttp of the process of asynchronous.
There is one more question I would like to ask you netizens
Python has the Asyncio and aiohttp in the crawler this type of IO task multi-threaded/multi-process still exist necessary? >> node. js
The answer is quite clear:
Http://www.goodpm.net/postreply/node.js/1010000007987098/Python with Asyncio and aiohttp in the crawler this type of IO task multi-threaded multiple processes still exist the need for it. html
Python has the Asyncio and aiohttp in the crawler this type of IO task multi-threaded/multi-process still exist necessary?