There is a troublesome problem in the multi-threaded crawl of WebMagic: When the scheduler can't get the URL, you can't quit immediately, you need to wait until the thread that has not been scratched is finished, and no new URL is generated to exit. Before using Thread.Sleep to implement, when the URL is not available, sleep for a period of time to take, to determine that no thread execution, and then exit.
But this approach is never elegant enough. Java has a wait/notify mechanism to solve this synchronization problem. So webmagic 0.4.0 used wait/notify mechanism instead of the previous thread.sleep mechanism. The code is as follows:
while (! Thread.CurrentThread (). isinterrupted () && stat.get () = = stat_running) {Request request = Scheduler.poll (this);if (Request = =NULL) {if (threadalive.get () = =0 && Exitwhencomplete) {Break }Wait until new URL added Waitnewurl (); }else {Final Request requestfinal = Request; Threadalive.incrementandget (); Executorservice.execute (New Runnable () {@OverridePublicvoid run () {try {ProcessRequest (requestfinal);} catch (Exception e) {logger. Error ( "download" + requestfinal + "error", e);} finally {threadalive.decrementandget (); Signalnewurl ();}} }); }}private void waitnewurl () {try {newurllock.lock (); try {newurlcondition.await ();} catch (Interruptedexception e) {}} finally {Newurllock.unlock ();}}
Here, when the thread is finished, it is called signalNewUrl()
to notify the main thread, stop waiting!
After the release of the 0.4.0, a user asked me, why do I sometimes get out of the way? I started to suspect that there might be a thread-safety problem, but I couldn't replicate it.
Thinking about it, there is a possibility of such implementation:
- Threadalive>0, perform
if (threadAlive.get() == 0 && exitWhenComplete)
a check skip, so ready to enter waitNewUrl()
;
- At this point the last child thread executes,
threadAlive.decrementAndGet();
and signalNewUrl();
executes successively;
- At this time the main thread entered
waitNewUrl()
, the result has been the implementation of the wireless, and no one can notify it, so the thread has been waiting ...
So it seems that adding double-check in lock is OK? But today read http://coolshell.cn/articles/4576.html This article, probably means: out of the question do not rely on guessing! Be sure to reproduce and test!
So decided to manually simulate! Open 10 threads, and mock all the parts, loop 10,000 times to execute, the code is not affixed, address: https://github.com/code4craft/webmagic/blob/master/webmagic-core/src/ Test/java/us/codecraft/webmagic/spidertest.java. Execution, sure enough to the 13th time to get stuck! After the Jstack, it sure is stuck newUrlCondition.await();
here!
Then join double-check:
private void waitNewUrl() { try { newUrlLock.lock(); //double check if (threadAlive.get() == 0 && exitWhenComplete) { return; } try { newUrlCondition.await(); } catch (InterruptedException e) { } } finally { newUrlLock.unlock(); }}
Results executed successfully! Solve this problem!
After this example, I also have a general understanding of why Wait/notify is always first lock
. Why is it? Have the opportunity to write an article to summarize it!
It's simple, isn't it? In fact, this article only want to explain one thing: out of the bug do not rely on guessing! Be sure to reproduce and confirm the resolution!
Cannot exit immediately when scheduler cannot get the URL