Examples of synchronization and Asynchronization in Python web crawlers: python web crawlers

Source: Internet
Author: User

Examples of synchronization and Asynchronization in Python web crawlers: python web crawlers

I. synchronous and asynchronous

# Synchronous Programming (only one thing can be done at a time, and the next thing can be done after it is done) <-a_url-> <-B _url-> <-c_url-> # asynchronous programming) <-a_url-> <-B _url-> <-c_url-> <-d_url-> <-e_url-> <-f_url-> <-g_url-> <-h_url-> <-- I _url --> <-- j_url -->

Template

Import asyncio # function name: You can continue to perform other tasks without waiting for the current task. Async def donow_meantime_dontwait (url): response = await requests. get (url) # function name: fast and efficient job async def fast_do_your_thing (): await asyncio. wait ([donow_meantime_dontwait (url) for url in urls]) # The following two lines are routines. Remember, loop = asyncio. get_event_loop () loop. run_until_complete (fast_do_your_thing ())

Tips:

  • The object in the await expression must be awaitable.
  • Requests does not support non-blocking
  • Aiohttp is the library used for asynchronous requests

Code

Import asyncioimport requestsimport timeimport aiohttpurls = ['https: // scheme, 'https: // scheme, 'https: // javasdef requests_meantime_dont_wait (url): print (url) async with aiohttp. clientSession () as session: async with session. get (url) as resp: print (resp. status) print ("{url} response ". format (url = url) async def fast_requsts (urls): start = time. time () await asyncio. wait ([requests_meantime_dont_wait (url) for url in urls]) end = time. time () print ("Complete in {} seconds ". format (end-start) loop = asyncio. get_event_loop () loop. run_until_complete (fast_requsts (urls ))

Gevent Introduction

Gevent is a python concurrent library that provides a clean API for various concurrent and network-related tasks.

Greenlet is the main mode used in gevent. It is a lightweight coroutine that connects to Python in the form of a C extension module. Greenlet runs all inside the operating system processes of the main program, but they are collaboratively scheduled.

Monkey patch

The requests library is blocking. To change requests synchronously to asynchronous. Asynchronous operations can be implemented only when the blocking type of the requests library is changed to non-blocking.

In the gevent library, the monkey patch allows gevent to modify most of the blocking system calls in the standard library. In this way, the blocking method of the application is changed to the (asynchronous) method of the program without changing the original code ).

Code

From gevent import monkeyimport geventimport requestsimport timemonkey. patch_all () def req (url): print (url) resp = requests. get (url) print (resp. status_code, url) def synchronous_times (urls): "synchronous request RunTime" start = time. time () for url in urls: req (url) end = time. time () print ('synchronous execution time {} s '. format (end-start) def asynchronous_times (urls): "" asynchronous request RunTime "start = time. time () gevent. joinall ([gevent. spawn (req, url) for url in urls]) end = time. time () print ('asynchronous execution time {} s '. format (end-start) urls = ['https: // your, 'https: // your, 'https: // your (urls) asynchronous_times (urls)

Gevent: asynchronous Theory and Practice

The core of the gevent library is Greenlet, a lightweight python module written in C. At any time, the system can only allow one Greenlet to be running.

When a greenlet encounters an I/O operation, such as accessing the network, it will automatically switch to another greenlet. After the I/O operation is completed, it will switch back to continue execution when appropriate. Since IO operations are very time-consuming, the program is often waiting. With gevent, We can automatically switch the coroutine to ensure that greenlet is always running, rather than waiting for IO.

Serial and asynchronous

The core of high concurrency is to divide a large task into a batch of subtasks, And the subtasks will be efficiently scheduled by the system for synchronization or Asynchronization. Switch between two subtasks, that is, context switching that is often mentioned.

Synchronization is to allow sub-tasks to be serialized, while Asynchronization is a bit of separation, but at any point in time, there is only one actually. sub-tasks are not really parallel, but fully utilize the time of fragmentation, do not waste waiting for the program. This is asynchronous and efficient.

Context switching in gevent is implemented through yield. In this example, we have two subtasks that use each other's waiting time to do their own tasks. Here we use gevent. sleep (0) to indicate that the program will stop for 0 seconds.

import geventdef foo(): print('Running in foo') gevent.sleep(0) print('Explicit context switch to foo again')def bar(): print('Explicit context to bar') gevent.sleep(0) print('Implicit context switch back to bar')gevent.joinall([ gevent.spawn(foo), gevent.spawn(bar) ])

Running sequence:

Running in fooExplicit context to barExplicit context switch to foo againImplicit context switch back to bar

Synchronous and asynchronous order

Synchronous running is serial, 123456..., but the asynchronous order is random (depending on the time consumed by the subtask)

Code

Import geventimport randomdef task (pid): "Some non-deterministic task" "gevent. sleep (random. randint (0.001) *) print ('Task % s done' % pid) # synchronization (results are more like serial) def synchronous (): for I in range ): task (I) # asynchronous (results are more like messy steps) def asynchronous (): threads = [gevent. spawn (task, I) for I in range (10)] gevent. joinall (threads) print ('synchronous synchronization: ') Synchronous () print ('asynchronous Asynchronous:') asynchronous ()

Output

Synchronous synchronization: Task 1 doneTask 2 doneTask 3 doneTask 4 doneTask 5 doneTask 6 doneTask 7 doneTask 8 doneTask 9 doneAsynchronous asynchronous: task 1 doneTask 5 doneTask 6 doneTask 2 doneTask 4 doneTask 7 doneTask 8 doneTask 9 doneTask 0 doneTask 3 done

All the tasks in the synchronization case are executed in sequence, which causes the main program to be blocked (blocking will suspend the execution of the main program ).

Gevent. spawn schedules input tasks (subtask sets). The gevent. joinall method blocks the current program. The program ends only when all greenlets are executed.

Practice

How to Use gevent to extract the data obtained by asynchronous access.

Enter "hello" in the search box of youdao dictionary and press Enter. Observe the data request situation and observe the url construction.

Analyze url rules

# Url construction only requires passing in word. url = "http://dict.youdao.com/w/eng/##/". format (word)

Parse webpage data

Def fetch_word_info (word): url = "http://dict.youdao.com/w/eng /{}/". format (word) resp = requests. get (url, headers = headers) doc = pq (resp. text) pros = ''for pro in doc. items ('. baav. pronounce '): pros + = pro. text () description = ''for li in doc. items ('# phrsListTab. trans-container ul li '): description + = li. text () return {'word': word, 'phonetic alphabet ': pros, 'annotate': description}

This is because the requests library allows the next access only after one access is completely completed. It cannot be extended to async through formal channels, so the monkey patch is used here

Code Synchronization

Import requestsfrom pyquery import PyQuery as pqimport geventimport timeimport gevent. monkeygevent. monkey. patch_all () words = ['good', 'bad', 'Cool ', 'hot', 'Nice ', 'better', 'head', 'up ', 'lowdown ', 'right', 'left', 'east'] def synchronous (): start = time. time () print ('synchronization started ') for word in words: print (fetch_word_info (word) end = time. time () print ("synchronization Run time: % s seconds" % str (end-start) # execute synchronization synchronous ()

Asynchronous code

Import requestsfrom pyquery import PyQuery as pqimport geventimport timeimport gevent. monkeygevent. monkey. patch_all () words = ['good', 'bad', 'Cool ', 'hot', 'Nice ', 'better', 'head', 'up ', 'lowdown ', 'right', 'left', 'east'] def asynchronous (): start = time. time () print ('asynchronous started ') events = [gevent. spawn (fetch_word_info, word) for word in words] wordinfos = gevent. joinall (events) for wordinfo in wordinfos: # obtain the data get method print (wordinfo. get () end = time. time () print ("asynchronous run time: % s seconds" % str (end-start) # execute asynchronous ()

We can handle real-time asynchronous access to websites, which will greatly improve the speed. Now we are crawling 12 words of information, that is to say, we visited the website 12 times in an instant, this is not a problem, if you crawl more than 10000 words, use gevent, in a few seconds, I sent a request to the website. Maybe the website has blocked the crawler.

Solution

The list is divided into several sub-lists and crawled in batches. For example, we have a number list (0-19), which should be evenly divided into four parts, that is, the sublist has five numbers. The following is the list-based classified solution I found in stackoverflow:

Method 1

Seqence = list (range (20) size = 5 # sub-list length output = [seqence [I: I + size] for I in range (0, len (seqence ), size)] print (output)

Method 2

chunks = lambda seq, size: [seq[i: i+size] for i in range(0, len(seq), size)]print(chunks(seq, 5))

Method 3

def chunks(seq,size): for i in range(0,len(seq), size):  yield seq[i:i+size]prinT(chunks(seq,5)) for x in chunks(req,5):   print(x) 

If the data volume is small, you can select either method. If it is large, we recommend that you use method 3.

Hands-on implementation

Import requestsfrom pyquery import PyQuery as pqimport geventimport timeimport gevent. monkeygevent. monkey. patch_all () words = ['good', 'bad', 'Cool ', 'hot', 'Nice ', 'better', 'head', 'up ', 'lowdown ', 'right', 'left', 'east'] def fetch_word_info (word): url = "http://dict.youdao.com/w/eng /{}/". format (word) resp = requests. get (url, headers = headers) doc = pq (resp. text) pros = ''for pro in doc. items ('. baav. pronounce '): pros + = pro. text () description = ''for li in doc. items ('# phrsListTab. trans-container ul li '): description + = li. text () return {'word': word, 'phonetic alphabet ': pros, 'annotate': description} def asynchronous (words): start = time. time () print ('asynchronous started ') chunks = lambda seq, size: [seq [I: I + size] for I in range (0, len (seq ), size)] for subwords in chunks (words, 3): events = [gevent. spawn (fetch_word_info, word) for word in subwords] wordinfos = gevent. joinall (events) for wordinfo in wordinfos: # obtain the data get method print (wordinfo. get () time. sleep (1) end = time. time () print ("asynchronous run time: % s seconds" % str (end-start) asynchronous (words)

Summary

The above is all the content of this article. I hope the content of this article has some reference and learning value for everyone's learning or work. If you have any questions, please leave a message to us, thank you for your support.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.