Scrapy pipeline handles CPU-intensive or blocking operations

Last Update:2018-07-25 Source: Internet

Author: User

Tags abs assert sleep

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The reactor of the twisted framework is suitable for handling short, non-blocking operations. But what if you're dealing with complex, or blocking, operations? Twisted provides line pool to perform slow operations on other threads rather than the main thread (twisted reactor threads)-Using the Reactor.callinthread () API. This means that reactor can stay running and react to events while performing calculations. Be sure to remember that processing in the thread pool is not thread safe. This means that when you use the global state, you also have to face all of the traditional multithreaded programming synchronization problems. The following is a simple example:

Class Usingblocking (object):
    @defer. Inlinecallbacks
    def process_item (self, item, spider): Price
        = item[" Price '][0] out

        = defer. Deferred ()
        reactor.callinthread (self._do_calculation, Price, out)

        item[' price '][0] = yield

        out Defer.returnvalue (item)

    def _do_calculation (self, Price, out):
        New_price = price + 1
        time.sleep (0.10)
        Reactor.callfromthread (Out.callback, New_price)

In the above pipeline, for each item, we extract its price field and want to process it in the _do_caculation () method. This method uses the Time.sleep (), a blocking operation. We call the Reactor.callinthread () method to make it run in another thread, the first parameter of the method is the function that you want to call, and the subsequent arguments are all passed to the called function as arguments. Here we pass the price to the called function, and a deferred object that is created out. When the _do_caculation () function finishes the calculation, we use the out callback function to return the value. Next, yield the deferred object and set a new value for price, and finally return to item.

In the _do_caculation () function we add the price one and then hibernate 100ms. In fact this time is very long, if call this function in reactor thread, it means that we can only process no more than 10 pages per second. However, if you put it in another thread to invoke it, this problem will not occur. These compute tasks are queued in the thread pool, waiting for one of the threads to be available, and then the thread will perform the task, Hibernate 100ms. The final step is to activate the out callback function. Normally, we can activate this: Out.callback (New_price), but now that we're in another thread, it's not safe. If we insist on doing so, the code of this deferred object, that is, the function of Scrapy, will be executed in another thread, which will cause the data to be corrupted. So we call the Reactor.callfromthread () function, and again, it takes a function as an argument and passes the extra arguments directly to the called function. This function queues in the main thread and waits to be called, which in turn unlocks the yield statement in the Process_item () method and restores the operation of Scrapy to the item.

What if we have a global presence in our pipeline? For example, counters or averages, etc., we need to use in the _do_caculation () function. For example there are the following two variables, beta and Delta:

Class Usingblocking (object):
    def __init__ (self):
        self.beta, Self.delta = 0, 0 ...

    def _do_calculation (self, Price, out):
        Self.beta + = 1
        time.sleep (0.001)
        Self.delta + = 1
        new_price = Price + Self.beta-self.delta + 1
        assert abs (NEW_PRICE-PRICE-1) < 0.01

        Time.sleep (0.10) ...

The above code has some problems and will give a assertion error when running. This is because if a thread switches between Self.beta + = 1 and Self.delta + = 1 statements, another thread resumes execution and uses the values of beta and delta to calculate the price, Here the thread will find that the two values are in an inconsistent state (beta is larger than the delta), so the error is generated. A short sleep in the middle will make the thread switch more likely to occur, but even without it, a race condition will also occur. To prevent the race condition from happening, we must use locks, such as Python's Threading.rlock () lock. With this recursive lock, you can ensure that two threads do not simultaneously execute the lock-protected critical Section code:

Class Usingblocking (object):
    def __init__ (self):
        ...
        Self.lock = Threading. Rlock () ...

    def _do_calculation (self, Price, out): With
        self.lock:
            Self.beta + = 1
            ...
            New_price = Price + Self.beta-self.delta + 1
        assert abs (NEW_PRICE-PRICE-1) < 0.01 ...

Now the code is no problem, note that we do not need to protect the entire code, only need to be able to override the global state of the use of the line.

In the Item_pipelines, add:

Item_pipelines = {...
    ' Properties.pipelines.computation.UsingBlocking ': $,
}

Run it will be found that the delay due to 100ms of sleep for the sake of pitch, but the throughput remains unchanged, about 25 per second.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More