Some precautions to use with Elastic-job (Lite)

Source: Internet
Author: User
Tags failover node server

In the previous period of time project development used when open source Elastic-job, use the process encountered some problems, although not likely to affect writing code, but as a commitment to move every brick of the yards, when the problem, we should not escape, should be in the difficulties also to be on, There is no difficulty in creating difficulties to go to the spirit rushed to fix it, so as to be easier to understand the nature of things, in order to facilitate the next move good every brick, rickets tiao as, I have to move bricks as well.

--------------------------------------------------------------------------------------------------------------- ---------------------------------------------------

1, the official said, "The same server can only run one instance of the same job, because the job is registered and managed according to IP ", then: if the program on the same computer to deploy two job instances, the results will be normal shard it?

The test result for this problem is: the shard parameter takes Shardingitemparameters=0=a,1=b,2=c,3=d,4=e,5=f,6=g,7=h,8=i,9=j as an example:

A: The same machine, the Tomcat--------running two different ports can not be fragmented, because the Tomcat port is not registered to ZK, the program does not recognize the two Tomcat, the two Tomcat does not trigger the Shard, if only one machine is deployed,

Two Tomcat gets the Shard parameter is all the Shard parameter 0=a,1=b,2=c,3=d,4=e,5=f,6=g,7=h,8=i,9=j

B: The same machine, deploy two Docker (or other container that can provide IP), in which the Tomcat is used to run the program--------can be fragmented, two Docker's different IPs will be registered to ZK, thereby triggering the Shard.

The test diagram is as follows:

Two images: Testdocker:v1 root testdocker:v2

Two Docker console logs

In this issue, the official note is clear that the job is registered and managed according to IP, the same IP can only run a job with the same name. Because the different servers in ZK are distinguished by IP, the information that the program registers on ZK is as follows:


As you can see, ZK has only the server IP, and no container ports such as information, so,,,,

2, operation of the interface, the "Pause" button has a role, because the official description, operation and maintenance procedures only for monitoring, and can not operate the task of the start and stop, where the role of the button to understand? (At present, this part of the explanation has been removed from the official website for revision)

At that time should be the version of the issue, the document did not update the result, the new version of the Operation program interface has 4 buttons, more than the original two, pause, restore, etc. are literal meaning, do not explain. The local test results for this problem are as follows:


As can be seen, the blue server in 2016-07-23 15:19:50--15:20:09 19 seconds, this time only white server is running. The white server does not receive the effect of a blue pause, and the Shard parameters are still calculated as normal blue.

After clicking "Recover" in the operations interface, the blue server starts to run again.

Note here that if you pause the primary node server (the main node is a checkmark), the task will be suspended for all servers, the operations interface has prompted text to explain the situation.

3, fragmentation, the server execution of the task time is unequal, whether there will be waiting for each other situation?

The test results are: A, b Two servers, all 5 seconds to execute, a execution very quickly ( think 0s) B execution is very slow ( 8 seconds), a does not wait for B. The verification diagram is as follows:


4. How does the failure transfer (failover) problem actually understand and how does it affect performance?

The official website does not have much explanation, this problem can be understood as: in the case of failover, if a server loses connection during the execution of the task, then the task assigned to the server will be partitioned and executed by the normal server in the current cluster before the next task is executed, and then the next task is completed. , fail-over is not turned on, then when the server is lost, the program will not be processed, let it be lost, but the next task will be re-shard.

Two machines, 10 shards, every 5 minutes to perform a task, the execution of a server to lose a connection with ZK, the test results are:


Circle 1:2 Machines do fetch 5 shards of data per unit (only one machine is intercepted here, and the current machine acquires 5-9 shards);

Lap 2: The healthy machine does re-crawl the Shard data of the machine that lost the connection, but the current time does not reach 5 minutes (The Shard that the lost connected machine obtains is 0-4);

Figure 3:5 minutes to the healthy machine to get all the shards;

Failover Performance has an impact on the premise is to open the monitor-execution, this monitoring is the key to the performance degradation. Failover=true, Monitor-execution=false, failover will not take effect.

The official early version of Monitor-execution is turned on by default, which is turned off by default in the 1.07 version we were using.

For this attribute, the official view is that short-term tasks are not recommended and will significantly affect performance. It is recommended to open for long execution cycles and business requirements. If the lost server, after the next re-shard, the original lost shards do not affect the business, can still not open.

That is, the turn-on will cause performance problems; Shut down, if any server is missing:

1, with ZK disconnected, with the other did not disconnect, the task is still normal operation, will not affect the business;

2, with ZK disconnected, and other also disconnected, this task is abnormal (database connection failure, etc.), will affect the business, need to process abnormal or post-mortem manual processing;

3, the root of ZK did not break, with the other disconnected, task shards will be normal, but the task of abnormal, will affect the business, the need for procedures to deal with abnormal or after the manual processing;

--------------------------------------------------------------------------------------------------------------- ---------------------------------------------------

The above questions for the actual development process encountered in the doubt, their own results in the local test, there is inevitably an ill-considered situation, there is a mistake welcome message discussion.

PS: Today saw the Xia Guan net, found that the official has launched Elastic-job-cloud! What is this stuff?! Looks very good very strong, later to look at, and strive to catch up with the footsteps of the open source great God,,,,

Some precautions to use with Elastic-job (Lite)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.