A task scheduling system is being developed to solve the task management, scheduling and monitoring under the large data platform.
Timed triggers and dependency triggers.
System module:
JobManager: Master of the Dispatch system, which provides RPC services, receives and processes all operations submitted by Jobclient/web, communicates with metadata, maintains job metadata, and is responsible for the maintenance, triggering, scheduling and monitoring of the task's unified configuration;
Jobmonitor: Monitoring the running job status, monitoring task pool, monitoring the job waiting to run;
Jobworker: Slave of dispatch system, get job from Task pool, take charge of start and collect job execution state, maintain to metadata, use jetty to provide task to run log access service.
Jobclient/web: Scheduling system client class, front-end interface to provide users, as a task configuration, management, monitoring and so on;
Task Meta data: currently using MySQL, save job configuration, dependencies, running history, resource configuration, alarm configuration, etc. use MySQL is not reliable, the task will become a bottleneck, must migrate to http://www.aliyun.com/zixun/ Aggregation/14305.html "> Distributed storage, zookeeper;
System Features:
Distributed: Capacity and load Capacity (Jobworker) can be linearly expanded;
High availability: Master Standby Master, once the master master is abnormal, Master Master will deliver the service;
High fault tolerance: Master restarts, the previously unfinished tasks will be scheduled to run;
Perfect and Easy-to-use Web user interface: For user Configuration, submission, query, monitoring tasks and task dependencies;
Supports any type of task: In addition to the MapReduce, Hive, pig, etc. of the Hadoop biosphere, it also supports the tasks of any other language development, such as Java, Shell, Python, Perl, spark, etc.
Complete logging: Collects and records the standard output and standard errors generated during the operation of the task, provides HTTP access, and the user can access the task log by accessing the log URL of the task;
Flexible dependencies between tasks: Any task can be triggered as a dependency on its own parent task;
Flexible and diverse alarm rules: in addition to failure alarm, also support the task timeout is not completed, the task timeout did not start alarm rules;
Difficult:
Depending on the trigger, the business date and the judgment of the descendants ' task, especially the manual run of the task, and run all the children's task scene;
Metadata design and storage: Just start to learn from the MapReduce architecture, metadata only to do persistence, the other all through RPC, in memory, but too high complexity.
Task recovery: After service exception restart, you want to restore all previous tasks to their original state.
Shared storage between Jobworker: Temporarily put the task program on the HDFs, Jobworker get to the local from HDFs while running the task.
Task Timeout Warning: When a task is over a certain time has not yet started or successful end of the trigger alarm, such alarm put into the quartz to trigger.
Jobworker can run on any machine, only need to be able to access metadata, some bad migration of business programs can run Jobworker on its machine, add tasks need to specify resources, so that when assigning tasks will only be assigned to the designated resources.
Different businesses need to be executed with different users: Bind the business type and user name.
Kill task: For Hadoop and hive tasks, you can't just destroy the execution process, you need to parse the Hadoop Jobid from the log, and execute the Hadoop kill command.
Original link: http://superlxw1234.iteye.com/blog/2147630