Concurrent Experience: 8 ways to grab a python image

Source: Internet
Author: User

In this paper, we use the crawler example to illustrate the comparison of the execution efficiency between multiple threads, multi-processes and co-threads.

Suppose we want to download images online now, an easy way is to use Requests+beautifulsoup. Note: All examples in this article use python3.5)

Single Thread

Example 1:get_photos.py

ImportOSImport TimeImportUUIDImportRequests fromBs4ImportBeautifulSoupdefOut_wrapper (func):#Simple adorner for recording program execution time    definner_wrapper (): Start_time=time.time () func () Stop_time=time.time ()Print('used time {}'. Format (stop_time-start_time)) returnInner_wrapperdefSave_flag (img, filename):#Save PicturePath = Os.path.join ('Down_photos', filename) with open (path,'WB') as Fp:fp.write (IMG)defDownload_one (URL):#Download a pictureImage =requests.get (URL) save_flag (image.content, str (UUID.UUID4 ()))defUser_conf ():#returns the URL of 30 imagesURL ='https://unsplash.com/'ret=requests.get (URL) soup= BeautifulSoup (Ret.text,"lxml") ZZR= Soup.find_all ('img') ret=[] Num=0 forIteminchZZR:ifItem.get ("src"). EndsWith (' the') andNum < 30: Num+ = 1ret.append (Item.get ("src"))    returnRet@out_wrapperdefDownload_many (): Zzr=user_conf () forIteminchZzr:download_one (item)if __name__=='__main__': Download_many ()

Example 1 is a sequential download, the average time to download 30 images is around 60s (results vary depending on the experimental environment).

This code can be used but not efficient, how to improve efficiency?

For reference, there are three ways to do this: multi-process, multi-threaded, and co-threading. Let us explain:

We all know that the Gil exists in Python (mainly CPython), but the Gil does not affect IO-intensive tasks, so for IO-intensive tasks, multithreading is more suitable (threads can open 100, 1000 and the number of simultaneous runs of the process is limited by the number of CPU cores, it is useless to open more)

However, this does not prevent us from experimenting with multiple processes.

Multi-process

Example 2

 fromMultiprocessingImportProcess fromGet_photosImportOut_wrapper, Download_one, User_conf@out_wrapperdefDownload_many (): Zzr=user_conf () task_list= []     forIteminchZzr:t= Process (Target=download_one, args=(item,)) T.start () task_list.append (t) [T.join () forTinchTask_list]#wait for the process to complete (in order to record the time)if __name__=='__main__': Download_many ()

This example reuses part of the code for example 1, and we only need to focus on this part of using multiple processes.

The author tested 3 times (the machine used is a dual-core hyper-thread, that is, only 4 download tasks in progress), the output is: 19.5s, 17.4s and 18.6s. The speed boost is not a lot and proves that multi-process is not suitable for IO-intensive tasks.

There is also a way to use multiple processes, which is the processpoolexecutor in the built-in module futures.

Example 3

From concurrent import futuresfrom get_photos import Out_wrapper, Download_one, User_conf@out_wrapperdef Download_many ( ):    ZZR = user_conf () with    futures. Processpoolexecutor (Len (ZZR)) as executor:        res = Executor.map (Download_one, ZZR)    return len (List (res)) if __ name__ = = ' __main__ ':    download_many ()

  

Using the Processpoolexecutor code is a lot simpler, and Executor.map is similar to the map usage in the standard library. The time-consuming is similar to Example 2. Multi-process is here, the following to experience the multithreading.

Multithreading

Example 4

ImportThreading fromGet_photosImportOut_wrapper, Download_one, User_conf@out_wrapperdefDownload_many (): Zzr=user_conf () task_list= []     forIteminchZzr:t= Threading. Thread (Target=download_one, args=(item,)) T.start () task_list.append (t) [T.join () forTinchTask_list]if __name__=='__main__': Download_many ()

The syntax of threading and multiprocessing is basically the same, but the speed is about 9s, and more processes are 1 time times higher.

The following example 5 and example 6 use the built-in module futures, respectively. Map and submit in Threadpoolexecutor, as_completed

Example 5

 from  concurrent import   futures  from  get_photos  Import   Out_wrapper, Download_one, user_conf@out_wrapper  def   Download_many (): ZZR  = User_conf () with futures.     Threadpoolexecutor (Len (ZZR)) as Executor:res  = Executor.map (Download_one, ZZR)  return   Len (list (res))  if  __name__  = =  "    Span style= "COLOR: #800000" >__main__   " : Download_many ()  

Example 6:

 fromConcurrentImportFutures fromGet_photosImportOut_wrapper, Download_one, User_conf@out_wrapperdefDownload_many (): Zzr=user_conf () with futures. Threadpoolexecutor (Len (ZZR)) as Executor:to_do= [Executor.submit (Download_one, item) forIteminchZZR] ret= [Future.result () forFutureinchfutures.as_completed (TO_DO)]returnretif __name__=='__main__': Download_many ()

Executor.map is easier to use because it is similar to the built-in map usage, and it has a feature: the order in which the results are returned is the same as the order in which the calls begin. However, it is usually preferable to obtain the results, regardless of the order in which they are submitted.

To do this, use Executor.submit and futures.as_completed together.

Finally, the gevent and the Asyncio are presented here respectively.

Gevent

Example 7

 fromGeventImportMonkeymonkey.patch_all ()Importgevent fromGet_photosImportOut_wrapper, Download_one, User_conf@out_wrapperdefDownload_many (): Zzr=user_conf () Jobs= [Gevent.spawn (Download_one, item) forIteminchZZR] Gevent.joinall (jobs)if __name__=='__main__': Download_many ()

Asyncio

Example 8

ImportUUIDImportAsyncioImportaiohttp fromGet_photosImportOut_wrapper, user_conf, Save_flagasyncdefdownload_one (URL): Async with Aiohttp. Clientsession () as Session:async with Session.get (URL) as Resp:save_flag (await Resp.read (), str (uuid.u Uid4 ())) @out_wrapperdefDownload_many (): URLs=user_conf () loop=asyncio.get_event_loop () To_do= [Download_one (URL) forUrlinchURLs] Wait_coro=asyncio.wait (TO_DO) Res, _=Loop.run_until_complete (Wait_coro) loop.close ()returnLen (res)if __name__=='__main__': Download_many ()

The duration of the process is similar to that of multi-line threads, except that the co-process is single-threaded. The specific principle is confined to the space here will not repeat.

But we have to say that Asyncio,asyncio is a Python3.4 added to the standard library, adding the async and await keywords to it in 3.5. Perhaps you can learn a little bit about the multithreaded multi-process example above, but to understand Asyncio you have to pay more time and energy.

In addition, it is difficult to use thread-writing programs because the scheduler can break threads at any time. Locks must be retained to protect the program from being interrupted during execution, preventing data from being in an invalid state.

The co-process is fully protected by default, and we have to explicitly output it to allow the rest of the program to run. To the coprocessor, without having to retain the lock and synchronize the operations between multiple threads, the coprocessor itself synchronizes because only one of the processes is running at any time. To surrender control, you can use yield or yield from (await) to return control to the scheduler.

Summarize

This article mainly introduces the basic usage of the concurrent related modules in Python. This is not covered by the concepts of processes, threads, Asyncio, blocking IO, non-blocking IO, synchronous io, asynchronous Io, event-driven, and so on. Everyone interested in the words can be Google or Baidu, you can also leave a message below, we discuss together.

Python Learning Exchange Group: 125240963

Unknown Little Demon

Reprint to: 80681775

Concurrent Experience: 8 ways to grab a python image

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.