Examples of synchronization and Asynchronization in Python web crawlers: python web crawlers
I. synchronous and asynchronous
# Synchronous Programming (only one thing can be done at a time, and the next thing can be done after it is done) <-a_url-> <-B _url-> <-c_url-> # asynchronous programming) <-a_url-> <-B _url-> <-c_url-> <-d_url-> <-e_url-> <-f_url-> <-g_url-> <-h_url-> <-- I _url --> <-- j_url -->
Template
Import asyncio # function name: You can continue to perform other tasks without waiting for the current task. Async def donow_meantime_dontwait (url): response = await requests. get (url) # function name: fast and efficient job async def fast_do_your_thing (): await asyncio. wait ([donow_meantime_dontwait (url) for url in urls]) # The following two lines are routines. Remember, loop = asyncio. get_event_loop () loop. run_until_complete (fast_do_your_thing ())
Tips:
- The object in the await expression must be awaitable.
- Requests does not support non-blocking
- Aiohttp is the library used for asynchronous requests
Code
Import asyncioimport requestsimport timeimport aiohttpurls = ['https: // scheme, 'https: // scheme, 'https: // javasdef requests_meantime_dont_wait (url): print (url) async with aiohttp. clientSession () as session: async with session. get (url) as resp: print (resp. status) print ("{url} response ". format (url = url) async def fast_requsts (urls): start = time. time () await asyncio. wait ([requests_meantime_dont_wait (url) for url in urls]) end = time. time () print ("Complete in {} seconds ". format (end-start) loop = asyncio. get_event_loop () loop. run_until_complete (fast_requsts (urls ))
Gevent Introduction
Gevent is a python concurrent library that provides a clean API for various concurrent and network-related tasks.
Greenlet is the main mode used in gevent. It is a lightweight coroutine that connects to Python in the form of a C extension module. Greenlet runs all inside the operating system processes of the main program, but they are collaboratively scheduled.
Monkey patch
The requests library is blocking. To change requests synchronously to asynchronous. Asynchronous operations can be implemented only when the blocking type of the requests library is changed to non-blocking.
In the gevent library, the monkey patch allows gevent to modify most of the blocking system calls in the standard library. In this way, the blocking method of the application is changed to the (asynchronous) method of the program without changing the original code ).
Code
From gevent import monkeyimport geventimport requestsimport timemonkey. patch_all () def req (url): print (url) resp = requests. get (url) print (resp. status_code, url) def synchronous_times (urls): "synchronous request RunTime" start = time. time () for url in urls: req (url) end = time. time () print ('synchronous execution time {} s '. format (end-start) def asynchronous_times (urls): "" asynchronous request RunTime "start = time. time () gevent. joinall ([gevent. spawn (req, url) for url in urls]) end = time. time () print ('asynchronous execution time {} s '. format (end-start) urls = ['https: // your, 'https: // your, 'https: // your (urls) asynchronous_times (urls)
Gevent: asynchronous Theory and Practice
The core of the gevent library is Greenlet, a lightweight python module written in C. At any time, the system can only allow one Greenlet to be running.
When a greenlet encounters an I/O operation, such as accessing the network, it will automatically switch to another greenlet. After the I/O operation is completed, it will switch back to continue execution when appropriate. Since IO operations are very time-consuming, the program is often waiting. With gevent, We can automatically switch the coroutine to ensure that greenlet is always running, rather than waiting for IO.
Serial and asynchronous
The core of high concurrency is to divide a large task into a batch of subtasks, And the subtasks will be efficiently scheduled by the system for synchronization or Asynchronization. Switch between two subtasks, that is, context switching that is often mentioned.
Synchronization is to allow sub-tasks to be serialized, while Asynchronization is a bit of separation, but at any point in time, there is only one actually. sub-tasks are not really parallel, but fully utilize the time of fragmentation, do not waste waiting for the program. This is asynchronous and efficient.
Context switching in gevent is implemented through yield. In this example, we have two subtasks that use each other's waiting time to do their own tasks. Here we use gevent. sleep (0) to indicate that the program will stop for 0 seconds.
import geventdef foo(): print('Running in foo') gevent.sleep(0) print('Explicit context switch to foo again')def bar(): print('Explicit context to bar') gevent.sleep(0) print('Implicit context switch back to bar')gevent.joinall([ gevent.spawn(foo), gevent.spawn(bar) ])
Running sequence:
Running in fooExplicit context to barExplicit context switch to foo againImplicit context switch back to bar
Synchronous and asynchronous order
Synchronous running is serial, 123456..., but the asynchronous order is random (depending on the time consumed by the subtask)
Code
Import geventimport randomdef task (pid): "Some non-deterministic task" "gevent. sleep (random. randint (0.001) *) print ('Task % s done' % pid) # synchronization (results are more like serial) def synchronous (): for I in range ): task (I) # asynchronous (results are more like messy steps) def asynchronous (): threads = [gevent. spawn (task, I) for I in range (10)] gevent. joinall (threads) print ('synchronous synchronization: ') Synchronous () print ('asynchronous Asynchronous:') asynchronous ()
Output
Synchronous synchronization: Task 1 doneTask 2 doneTask 3 doneTask 4 doneTask 5 doneTask 6 doneTask 7 doneTask 8 doneTask 9 doneAsynchronous asynchronous: task 1 doneTask 5 doneTask 6 doneTask 2 doneTask 4 doneTask 7 doneTask 8 doneTask 9 doneTask 0 doneTask 3 done
All the tasks in the synchronization case are executed in sequence, which causes the main program to be blocked (blocking will suspend the execution of the main program ).
Gevent. spawn schedules input tasks (subtask sets). The gevent. joinall method blocks the current program. The program ends only when all greenlets are executed.
Practice
How to Use gevent to extract the data obtained by asynchronous access.
Enter "hello" in the search box of youdao dictionary and press Enter. Observe the data request situation and observe the url construction.
Analyze url rules
# Url construction only requires passing in word. url = "http://dict.youdao.com/w/eng/##/". format (word)
Parse webpage data
Def fetch_word_info (word): url = "http://dict.youdao.com/w/eng /{}/". format (word) resp = requests. get (url, headers = headers) doc = pq (resp. text) pros = ''for pro in doc. items ('. baav. pronounce '): pros + = pro. text () description = ''for li in doc. items ('# phrsListTab. trans-container ul li '): description + = li. text () return {'word': word, 'phonetic alphabet ': pros, 'annotate': description}
This is because the requests library allows the next access only after one access is completely completed. It cannot be extended to async through formal channels, so the monkey patch is used here
Code Synchronization
Import requestsfrom pyquery import PyQuery as pqimport geventimport timeimport gevent. monkeygevent. monkey. patch_all () words = ['good', 'bad', 'Cool ', 'hot', 'Nice ', 'better', 'head', 'up ', 'lowdown ', 'right', 'left', 'east'] def synchronous (): start = time. time () print ('synchronization started ') for word in words: print (fetch_word_info (word) end = time. time () print ("synchronization Run time: % s seconds" % str (end-start) # execute synchronization synchronous ()
Asynchronous code
Import requestsfrom pyquery import PyQuery as pqimport geventimport timeimport gevent. monkeygevent. monkey. patch_all () words = ['good', 'bad', 'Cool ', 'hot', 'Nice ', 'better', 'head', 'up ', 'lowdown ', 'right', 'left', 'east'] def asynchronous (): start = time. time () print ('asynchronous started ') events = [gevent. spawn (fetch_word_info, word) for word in words] wordinfos = gevent. joinall (events) for wordinfo in wordinfos: # obtain the data get method print (wordinfo. get () end = time. time () print ("asynchronous run time: % s seconds" % str (end-start) # execute asynchronous ()
We can handle real-time asynchronous access to websites, which will greatly improve the speed. Now we are crawling 12 words of information, that is to say, we visited the website 12 times in an instant, this is not a problem, if you crawl more than 10000 words, use gevent, in a few seconds, I sent a request to the website. Maybe the website has blocked the crawler.
Solution
The list is divided into several sub-lists and crawled in batches. For example, we have a number list (0-19), which should be evenly divided into four parts, that is, the sublist has five numbers. The following is the list-based classified solution I found in stackoverflow:
Method 1
Seqence = list (range (20) size = 5 # sub-list length output = [seqence [I: I + size] for I in range (0, len (seqence ), size)] print (output)
Method 2
chunks = lambda seq, size: [seq[i: i+size] for i in range(0, len(seq), size)]print(chunks(seq, 5))
Method 3
def chunks(seq,size): for i in range(0,len(seq), size): yield seq[i:i+size]prinT(chunks(seq,5)) for x in chunks(req,5): print(x)
If the data volume is small, you can select either method. If it is large, we recommend that you use method 3.
Hands-on implementation
Import requestsfrom pyquery import PyQuery as pqimport geventimport timeimport gevent. monkeygevent. monkey. patch_all () words = ['good', 'bad', 'Cool ', 'hot', 'Nice ', 'better', 'head', 'up ', 'lowdown ', 'right', 'left', 'east'] def fetch_word_info (word): url = "http://dict.youdao.com/w/eng /{}/". format (word) resp = requests. get (url, headers = headers) doc = pq (resp. text) pros = ''for pro in doc. items ('. baav. pronounce '): pros + = pro. text () description = ''for li in doc. items ('# phrsListTab. trans-container ul li '): description + = li. text () return {'word': word, 'phonetic alphabet ': pros, 'annotate': description} def asynchronous (words): start = time. time () print ('asynchronous started ') chunks = lambda seq, size: [seq [I: I + size] for I in range (0, len (seq ), size)] for subwords in chunks (words, 3): events = [gevent. spawn (fetch_word_info, word) for word in subwords] wordinfos = gevent. joinall (events) for wordinfo in wordinfos: # obtain the data get method print (wordinfo. get () time. sleep (1) end = time. time () print ("asynchronous run time: % s seconds" % str (end-start) asynchronous (words)
Summary
The above is all the content of this article. I hope the content of this article has some reference and learning value for everyone's learning or work. If you have any questions, please leave a message to us, thank you for your support.