Workflow Scheduler Azkaban Learning

Last Update:2015-01-23 Source: Internet

Author: User

Tags hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. What is Azkaban?
We should have encountered this scenario in our work: There is a task, this task can be divided into a number of smaller tasks to complete, the reason is divided because the small tasks can be carried out concurrently, for example, a shell script execution of the command bar, big task A can be divided into B, C, D, E four subtasks (scripts) completed, and B and C can be done simultaneously, D depends on the output of B and C, and e depends on the output of D, so we generally may be open two terminals simultaneously execute B and C, and so on two are executed after the execution of D, followed by the implementation of E. The entire execution of the process requires us to participate, but the entire execution process is similar to a directed loop graph, the execution of each sub-task can be seen as a flow of the whole task, we can simultaneously from the non-integration of the node start execution, any no flow (no path between two nodes) relationship node can be executed in parallel, Man-made control is a bit hard (because a lot of tasks need to be done late at night, generally we are writing scripts and setting cron), this is what we need is a workflow scheduler. Azkaban is the task of doing this (in fact, primarily for the task of the Hadoop ecosystem), which is implemented by LinkedIn and open source, primarily for running a set of work and processes within a workflow in a specific order, and its configuration is through simple key: The value pair is set by the dependencies in the configuration, and the dependency must be non-loopback, otherwise it will be considered invalid workflow. Azkaban has the following functional features:

Web user Interface
Easy Upload Workflow
Easy to set up relationships between tasks
Scheduling a workflow
Authentication/Authorization (permission to work)
Ability to kill and restart workflows
Modular and pluggable plug-in mechanism
Project Workspace
Logging and auditing of workflows and tasks

I think these are some of the mainstream workflow scheduler should support the function, I think Azkaban Web page is done better, this can greatly reduce management costs, it supports the task scheduling type is based on plug-in, This also allows us to implement our own plug-ins to complete specific requirements. In addition, it can be in task completion, failure, success of the time to send email, support SLA settings and other functions, overall, the function is very powerful.
2. Installation and Deployment
Azkaban is divided into three components: MySQL server, Web server, and executor server, where MySQL is used to store projects and execution plans (attribute information for all tasks, execution plans, results of execution, and output), and information about each execution The Web server uses jetty to provide Web services externally, which is convenient for users to manage through the Web page, the execution server is responsible for the specific workflow commit, executes, can start multiple execution servers, they coordinate the execution of the task through the MySQL database.
First need to download the various modules from the official web, are binary installation package format, of course, you can also use the source code compilation: Http://azkaban.github.io/downloads.html The next installation process can refer to:/http blog.javachen.com/2014/08/25/install-azkaban/Because the Web client is accessed via HTTPS, it is necessary to create a KeyStore certificate file, using the command: Keytool-keystore keystore-alias jetty-genkey-keyalg RSA, Follow the prompts to enter the required information, the last input <jetty> key password can be the same as the KeyStore password, And you need to modify the properties of the jetty server in the Web server's configuration file azkaban.properties, where jetty.keystore=keystorejetty.password= Redhatjetty.keypassword=redhatjetty.truststore=keystorejetty.trustpassword=redhatSet the information for the certificate file that you generated. Then you can enter https://ip:8443 access Azkaban in the browser (login username and password are set in the user profile of the Web server, we use admin here).
3. Testing
Here we do a simple test, because Azkaban native supports shell commands (and so can support shell scripts and other scripting programs like Python), so you can test with simple shell commands, we create 4 subtasks, The configuration of each subtask is the task name. Job file. They are configured as follows: Test.job
Type=command
Command=sleep 3
Command.1=echo "Hello World"
Start.job

Type=command
Command=sleep 5
Command.1=echo "Start Execute"
Sleep.job
Type=command
Dependencies=test, start
Command=sleep 10
Finish.job
Type=command
Dependencies=sleep
Command=echo "Finish"Here, the task depends on the dependencies property to identify the task that it relies on, there can be one or more, through the "," split, the type of these tasks are Command,azkaban also support other types of command types, some need to install plug-ins to support.Then we put the four job files in a directory compressed into a zip file, the Azkaban Web interface on the homepage can be created by the "Create Project" button to create a new workflow, enter the necessary information will be entered into the Project interface, We can upload the task flow we want to perform via upload, and we can overwrite it repeatedly upload. However, the execution results of the previous task flow are not overwritten. If there is a problem with the configuration of the workflow (for example, interdependencies), the upload will not succeed, but the prompt is not visible. After waiting for the compressed file to upload successfully, we can view the dependency graphs of each task through the interface:
a single execution of a workflow can be initiated through the Execute Flow button, which is then entered into the configuration interface, including "Flow View", "Notification", "Failure Options", " Concurrent "," Flow Parameters ", also need to note is the lower left corner of the schedule button, where you can set the timing of the workflow execution. Note that this is the time that each workflow executes and you don't see the history setting, but if you want to repeat the previous setup, you can find the previous execution and run it again (this time you need to go to the configuration page, but the configuration of that run is saved). It is important to note that in the "Failure Options" and "Concurrent" configurations, they are configured to handle the process after a failed task execution in the workflow and the multiple execution flow of this project (multiple execute) if there is parallelism. We do not configure here, directly execute the command:The ID of the execution will be prompted after the commit (I think it would be better to mark it with a recognizable string), which is globally unique, meaning that each execution of multiple project increments to get the new exec ID.After execution, you can view the results of each task flow and the execution results of each subtask through the Web interface.

In the Graph tab, you can see the status of each task execution, which task is currently executed, flow log will be in real-time output workflow of the running logs, click on each sub-task can see the running status of subtasks and real-time output of the log information, overall is very convenient.
Here are a few concepts: project, flow, and job, the first project is a whole to perform a task, it can contain multiple flow, Each project can upload a. zip file, flow is independent of each other, but there is a total flow, this flow may refer to other flow as part of its execution (equivalent to the total flow of a sub-job, but the job is a flow). Each flow contains multiple jobs, which are independent of each other and are set up by dependencies in the job file, and each flow's end job can be used as the flow's identity (flow name). We can add a flow as a job to another flow in this way: Jobgroup.job Type=flow
Flow.name=finish
Dependencies=realstartfinish is the identity of the previously defined flow (because it is an abort job), this flow can set other dependencies as a job, and here is a task dependency graph that contains the child flow:

I think it's designed to be independent of each flow and to facilitate flow reuse.
4. User Management
Azkaban has the concept of users and user groups, The configuration information for users and user groups and permissions is saved in the configuration file Azkaban-users.xml, the authentication method is implemented by Azkaban.user.XmlUserManager, and the specific configuration can be azkaban.properties (under the Conf of the Web server). Line configuration:

parameter	default
User.manager.class	Azkaban.user.XmlUserManager
User.manager.xml.file	Azkaban-users.xml

We can configure three types of content in Azkaban-users.xml: User, group, and Role,user can configure username, password, roles, group information, configure username, password, user permissions, and groups to which they belong Group items can be configured with name and roles, respectively, to configure the group name and permissions used by this group; Role defines permission information that can be configured with name and permissions, representing the rule name and the assigned permission information, respectively. Azkaban supported permissions include the following:

permissions	values
ADMIN	Can do tasks, including adding and modifying permissions to other users
READ	Access to content and log information for each project only
WRITE	You can upload and modify the properties of a task in the created project, and you can delete any project
EXECUTE	Allow users to perform any task flow
SCHEDULE	Allow users to add, remove scheduling information for any task flow
Createprojects	Allow creation of new project without creating project

The permissions settings here are not fine-grained to each user in each project, and each of the users has permissions to perform the same actions under each project, and the permissions information between the user and the user group is not clear, If you use a user group as the allocation unit for permissions (that is, all users under a user group have the same permissions), it is a bit redundant for each user to specify the permissions again.

5. API
Azkaban also provides API interfaces to use, which allows you to implement your own management based on Azkaban, which communicates with the Web server via HTTPS, because there is a concept of user and permissions in Azkaban, so you need to log in before calling the API. After successful login, the user will be returned with a session ID, after which all operations need to carry this ID to determine whether the user has permissions. If the session ID is invalid, then the calling API will return "error": "Session" information, if not carry the Session.id parameter, will return to the login interface HTML file content (some session ID access will also return such content). The APIs provided by Azkaban include: Please refer to the official documentation: HTTP://AZKABAN.GITHUB.IO/AZKABAN/DOCS/2.5/#ajax-API
1, Authenticate: User login operation, need to carry a user name and password, if successful login will return a session ID for subsequent requests. 2. Create a project: Creating a new project, which needs to be done prior to any of the project operations, the need to enter project's name as a unique indicator of this project, as well as to include the project's descriptive information, In fact, the same as the input to create project on a Web page. 3. Delete a project: Deletes an existing project that does not reply to the message and needs to enter the project's identity. 4, Upload a project zip: Upload a zip file to a project, typically after you create a project, the upload will overwrite the previously uploaded content. 5. Fetch flows of a project: Get all the flow information under a project, enter the identity that needs to specify project, there may be more than one flow under project, and the flow of the output will only contain flowid to identify each flow. 6. Fetch Jobs A flow: Get the information for all job under a flow, because each command is independent on the API side, so you need to enter project's identity and Flow's identity, and output the information that contains each job, including the job's identity (ID), Job type and job that the job has been directly in. 7, fetch executions of a flow: to get the flow execution, need to develop a specific project and flow, this interface can be paged back, so need to set start specify the starting index and length specify the number of returns, Because each flow can be executed individually or as a flow of other, this returns the information for each execution within the flow's specified range. Each execution information includes the start time, the user who commits the execution, the status of the execution, the commit time, the ID of this execution in the global (incremented execid), the ProjectID, the end time, and the Flowid. 8. Fetch Running executions of a flow: Gets the execution information of the currently executing flow, enters the identity that includes project and flow, and returns all execution IDs (global exec IDs) that the flow is executing. 9. Execute a flow: Start a flow execution, this input is more, because every time you start flow on the Web interface, you need to set a fewSettings, you can set the begging configuration information outside of the dispatch, the input also needs to include project and flow identification, output for this Flow ID and exec id10, cancel a flow execution: Cancel a flow execution, You need to enter the global Exec ID, because this ID is globally unique, it can be identified by it, no need to enter the identity of project and flow, if the execution has ended, will return an error message. 11. Pause a Flow execution: Pause execution once and enter as exec ID. If this execution is not in the running state, an error message is returned. 12, Resume a Flow execution: Restart the execution, enter the Exec ID, if this execution is already in progress, do not return any errors, if it is no longer running, return an error message. 13. Fetch a Flow execution: Gets all the information executed at once, entered as the Exec ID, the output includes the properties of this execution (see 7), and also includes the execution of all jobs performed this time. 14, fetch execution job Logs: Get the execution log of a job in one execution, you can use the job execution log as a file, where you need to make the exec ID, Job identification and read the contents of this file return (offset+length), Returns the log content for the specified range. 15. Fetch Flow Execution Updates: Does this return the execution of each task after the last view? This is a bit confusing. It should be the information obtained when flow executes the progress.
As you can see from the interface here, Azkaban provides APIs that can be used to simply create project, flow, view project, Flow, execute, and so on, and the Web interface is much richer than this, and if we want to develop based on Azkaban, On the basis of these interfaces, I think we can also analyze the Azkaban database, get the information we want from the database (basic write operations can be implemented through these APIs, so we just need to read from the database). However, this is a disadvantage relative to the use of the API, after all, as the version of the updated database structure may change, but this is also a way.
6. Summary
Well, this article mainly describes the installation and use of Azkaban, but it is mainly used to perform various operations within the Hadoop ecosystem and Java programs, but the simple use or let me realize the power of this tool, but I still have a question, Azkaban The main functions of the three modules are: MySQL for data storage, Web server for more use and graphical display, executor is the real task of the server, so all job execution needs on the executor on the machine, The way the job executes when a child process is started (which can be judged by the job execution is to see what is being executed), then the executor needs to install all the supported tasks for the tool, jar package, etc. If this is a more resource-intensive job (such as a Java program with a CPU utilization of 100%), then it will have an impact on the execution of other jobs, so is the scalability of this architecture a bit inadequate? Or because the tool is primarily performing some hadoop tasks, the client's pressure is not big, so this is not considered.But in general this is a better tool, at least the Web interface can be very convenient and intuitive to see the execution of the task and the results of the operation (P.s.azkaban on the success of the task execution is determined? ), although the document says it can support multiple executor, but actually did not find this use, I think I can continue to improve it to achieve any of the various machines between the program of parallelism, for example, there are multiple jobs can be executed in parallel, I have more than one executor server, I can deploy any job to any one of the executor and take full advantage of all the hardware resources. Jobtracker and Tasktracker in Hadoop, don't you? Never mind. This is purely a personal nonsense.

Workflow Scheduler Azkaban Learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More