Scrapy pipeline handles CPU-intensive or blocking operations

Source: Internet
Author: User
Tags abs assert sleep

The reactor of the twisted framework is suitable for handling short, non-blocking operations. But what if you're dealing with complex, or blocking, operations? Twisted provides line pool to perform slow operations on other threads rather than the main thread (twisted reactor threads)-Using the Reactor.callinthread () API. This means that reactor can stay running and react to events while performing calculations. Be sure to remember that processing in the thread pool is not thread safe. This means that when you use the global state, you also have to face all of the traditional multithreaded programming synchronization problems. The following is a simple example:

Class Usingblocking (object):
    @defer. Inlinecallbacks
    def process_item (self, item, spider): Price
        = item[" Price '][0] out

        = defer. Deferred ()
        reactor.callinthread (self._do_calculation, Price, out)

        item[' price '][0] = yield

        out Defer.returnvalue (item)

    def _do_calculation (self, Price, out):
        New_price = price + 1
        time.sleep (0.10)
        Reactor.callfromthread (Out.callback, New_price)

In the above pipeline, for each item, we extract its price field and want to process it in the _do_caculation () method. This method uses the Time.sleep (), a blocking operation. We call the Reactor.callinthread () method to make it run in another thread, the first parameter of the method is the function that you want to call, and the subsequent arguments are all passed to the called function as arguments. Here we pass the price to the called function, and a deferred object that is created out. When the _do_caculation () function finishes the calculation, we use the out callback function to return the value. Next, yield the deferred object and set a new value for price, and finally return to item.

In the _do_caculation () function we add the price one and then hibernate 100ms. In fact this time is very long, if call this function in reactor thread, it means that we can only process no more than 10 pages per second. However, if you put it in another thread to invoke it, this problem will not occur. These compute tasks are queued in the thread pool, waiting for one of the threads to be available, and then the thread will perform the task, Hibernate 100ms. The final step is to activate the out callback function. Normally, we can activate this: Out.callback (New_price), but now that we're in another thread, it's not safe. If we insist on doing so, the code of this deferred object, that is, the function of Scrapy, will be executed in another thread, which will cause the data to be corrupted. So we call the Reactor.callfromthread () function, and again, it takes a function as an argument and passes the extra arguments directly to the called function. This function queues in the main thread and waits to be called, which in turn unlocks the yield statement in the Process_item () method and restores the operation of Scrapy to the item.

What if we have a global presence in our pipeline? For example, counters or averages, etc., we need to use in the _do_caculation () function. For example there are the following two variables, beta and Delta:

Class Usingblocking (object):
    def __init__ (self):
        self.beta, Self.delta = 0, 0 ...

    def _do_calculation (self, Price, out):
        Self.beta + = 1
        time.sleep (0.001)
        Self.delta + = 1
        new_price = Price + Self.beta-self.delta + 1
        assert abs (NEW_PRICE-PRICE-1) < 0.01

        Time.sleep (0.10) ...

The above code has some problems and will give a assertion error when running. This is because if a thread switches between Self.beta + = 1 and Self.delta + = 1 statements, another thread resumes execution and uses the values of beta and delta to calculate the price, Here the thread will find that the two values are in an inconsistent state (beta is larger than the delta), so the error is generated. A short sleep in the middle will make the thread switch more likely to occur, but even without it, a race condition will also occur. To prevent the race condition from happening, we must use locks, such as Python's Threading.rlock () lock. With this recursive lock, you can ensure that two threads do not simultaneously execute the lock-protected critical Section code:

Class Usingblocking (object):
    def __init__ (self):
        ...
        Self.lock = Threading. Rlock () ...

    def _do_calculation (self, Price, out): With
        self.lock:
            Self.beta + = 1
            ...
            New_price = Price + Self.beta-self.delta + 1
        assert abs (NEW_PRICE-PRICE-1) < 0.01 ...

Now the code is no problem, note that we do not need to protect the entire code, only need to be able to override the global state of the use of the line.

In the Item_pipelines, add:

Item_pipelines = {...
    ' Properties.pipelines.computation.UsingBlocking ': $,
}

Run it will be found that the delay due to 100ms of sleep for the sake of pitch, but the throughput remains unchanged, about 25 per second.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.