Workflow scheduler azkaban installed

Source: Internet
Author: User
Keywords Installation Azkaban Workflow Scheduler
Tags .gz access access control address alias analysis api based


2.1.1 Why do I need a workflow scheduling system?

A complete data analysis system is usually composed of a large number of task units:

shell script program, java program, mapreduce program, hive script and so on

The task units exist in time and before and after the dependencies

In order to organize such a complex execution plan well, a workflow scheduling system is needed to schedule the execution.

For example, we may have a demand, a business system produces 20G raw data every day, we deal with it every day, the processing steps are as follows:

1, Hadoop first through the original data synchronization to HDFS;

2, With the MapReduce computing framework to convert the original data, the generated data is stored in the form of a partition table to multiple Hive tables;

3, Need to JOIN the data of many tables in Hive, get a detail data Hive big table;

4, the detailed data of complex statistical analysis, the results reported information;

5, the results of statistical analysis needs to be synchronized to the business system data for business use.

2.1.2 workflow scheduling to achieve

Simple task scheduling: direct use of linux crontab to define;

Complex task scheduling: development scheduling platform

Or use a ready-made open source scheduling system, such as ooize, azkaban and so on

2.1.3 common workflow scheduling system

There are currently many workflow schedulers on the market

In the Hadoop domain, common workflow schedulers are Oozie, Azkaban, Cascading, Hamake, etc.


2.1.4 Comparison of various scheduling tools

The following table compares the four key features of the hadoop workflow scheduler. Although these workflow schedulers can basically meet the same demand scenarios, there are significant differences in terms of design concepts, target users, and application scenarios , When making technical selection, you can provide a reference

Characteristics Hamake Oozie Azkaban Cascading

Workflow Description Language XML XML (xPDL based) text file with key / value pairs Java API

Dependent mechanism data-driven explicit explicit explicit

Yes No Web container No Yes Yes No

Progress tracking console / log messages web page web page Java API

Hadoop job scheduling support no yes yes yes

Run command line utility daemon daemon API

Pig support yes yes yes yes

Event notification no no no yes

Need to install no yes yes no

Supported hadoop version 0.18+ 0.20+ currently unknown 0.18+

Retry support no workflownode evel yes yes

Run any command yes yes yes yes

Amazon EMR Support yes no currently unknown yes

2.1.5 Azkaban vs. Oozie

The two most popular schedulers on the market are given below in more detail for technical reference. Overall, ooize is a heavyweights task scheduling system compared to azkaban and is full featured, but the configuration is more complicated to use. If you can not care about the lack of certain features, lightweight scheduler azkaban is a very good candidate.

Details are as follows:


Both can schedule mapreduce, pig, java, scripting workflow tasks

Both can be scheduled to perform workflow tasks

Workflow definition

Azkaban uses the Properties file to define the workflow

Oozie uses an XML file to define the workflow

Workflow reference

Azkaban supports direct pass parameters such as [Math Processing Error] {fs: dirSize (myInputDir)}

Timed execution

Azkaban's timing tasks are time-based

Oozie's timing tasks are based on time and entering data

Resource management

Azkaban has more stringent access control, such as user workflow read / write / execution and other operations

Oozie does not currently have strict access control

Workflow execution

Azkaban has two modes of operation: solo server mode (executor server and web server on the same node) and multi server mode (executor server and web server can be deployed on different nodes)

Oozie operates as a workflow server that supports multi-user and multi-workflow

Workflow management

Azkaban supports browser and ajax methods for workflow

Oozie supports command line, HTTP REST, Java API, browser action workflow

2.2 Introduction to Azkaban

Azkaban is a batch workflow task scheduler open sourced by Linkedin. Used to run a set of jobs and processes in a specific order within a workflow. Azkaban defines a KV file format to establish the dependencies between tasks and provides an easy-to-use web user interface to maintain and track your workflow.

It has the following features:

Web user interface

Convenient upload workflow

Easy to set the relationship between tasks

Scheduling workflow

Certification / Authorization (Permission to work)

Kill and restart the workflow

Modular and pluggable plug-in mechanism

Project workspace

Logging and auditing of workflow and tasks

Azkaban official website

It has three important components:

1. relational database (currently only supports mysql)

2. Web management server-AzkabanWebServer

3. Execution server-AzkabanExecutorServer

3 Azkaban Installation and Deployment

Ready to work

Azkaban Web Server


Azkaban Executive Server



At present, azkaban only supports mysql, need to install mysql server, this document has been installed by default mysql server, and the establishment of a root user, the root password.

Download address:

azkaban-executor-server-2.5.0.tar.gz (Execution Server)

azkaban-web-server-2.5.0.tar.gz (Management Server)

azkaban-sql-script-2.5.0.tar.gz (mysql script)

Azkaban has three modes of operation:

solo server mode: The simplest mode, the database built-in H2 database, management server and execution server are running in a process, the task is not large project can use this mode.

two server mode: database for mysql, management server and execution server in different processes, this mode, the management server and execution server do not affect each other

multiple executor mode: In this mode, the execution server and the management server are on different hosts, and there may be multiple execution servers.

Our project requirements is not too high, I use the second model this time, the management server, the implementation of the server sub-process, but on the same host.


Upload the installation files to the cluster, the best upload to install hive, sqoop machine, to facilitate the implementation of the command

In the current user directory new azkabantools directory, used to store the source installation file. New azkaban directory for the azkaban run the program

azkaban web server installation

Unzip azkaban-web-server-2.5.0.tar.gz

Command: tar -zxvf azkaban-web-server-2.5.0.tar.gz

Unzip the azkaban-web-server-2.5.0 to the azkaban directory and rename the webserver

Command: mv azkaban-web-server-2.5.0 ../azkaban

cd ../azkaban

mv azkaban-web-server-2.5.0 server

azkaban executive server installation

Unzip azkaban-executor-server-2.5.0.tar.gz

Command: tar -zxvf azkaban-executor-server-2.5.0.tar.gz

Move the unzipped azkaban-executor-server-2.5.0 to the azkaban directory and rename the executor

Command: mv azkaban-executor-server-2.5.0 ../azkaban

cd ../azkaban

mv azkaban-executor-server-2.5.0 executor

Import azkaban script

Unzip: azkaban-sql-script-2.5.0.tar.gz

Command: tar -zxvf azkaban-sql-script-2.5.0.tar.gz

Will extract the mysql script into mysql:

Into mysql

mysql> create database azkaban;

mysql> use azkaban;

Database changed

mysql> source /home/hadoop/azkaban-2.5.0/create-all-sql-2.5.0.sql;

Create an SSL configuration

Reference Address:

Configure SSL KeyStore

In the need to enter the password, enter the password parameter above configuration file, my password is 123456.

keytool-genkey-keystore keystore -alias jetty-azkaban -keyalg RSA -validity 3560

keytool -export -alias jetty-azkaban -keystore keystore -rfc -file selfsignedcert.cer

keytool -import -alias certificatekey -file selfsignedcert.cer -keystore truststore

Move the keystore and truststore files generated above to the $ AK_HOME / web path.

If you need to delete, use the following command.

keytool -delete -alias jetty-azkaban -keystore keystore -storepass azkaban

Command: keytool -keystore keystore -alias jetty -genkey -keyalg RSA

After running this command, you will be prompted to enter the current generated keystor password and the corresponding information, enter the password please labor, the following information:

Enter keystore password:

Enter the new password again:

What is your first name and last name?


What is your organization's name?


What is your organization's name?


What is the name of your city or region?


What is the name of your state or province?


What is the two-letter country code for this unit?

[Unknown]: CN

CN = Unknown, OU = Unknown, O = Unknown, L = Unknown, ST = Unknown, C = CN Correct?

[No]: y

Enter the master password

(If same as the keystore password, press Enter):

Enter the new password again:

After the completion of the above work, the keystore certificate file will be generated in the current directory, the keystore test to the azkaban web server root directory. Such as: cp keystore /opt/apps/azkaban/azkaban-web-2.5.0/web


Note: First configure the time zone on the server node

1, Mr. into the time zone configuration file Asia / Shanghai, with the interactive command tzselect can

2, copy the time zone file, covering the system local time zone configuration

cp / usr / share / zoneinfo / Asia / Shanghai / etc / localtime

azkaban web server configuration

Enter the azkaban web server installation directory conf directory

Modify the file

Command vi

The content is as follows:

Azkaban Personalization Settings = TestAzkaban # server UI name for the name displayed above the server

azkaban.label = My Local Azkaban # Description

azkaban.color = # FF3601 #UI color

azkaban.default.servlet.path = / index #

web.resource.dir = web / # The default root web directory = Asia / Shanghai # The default time zone, changed to Asia / Shanghai defaults to the United States

Azkaban UserManager class

user.manager.class = azkaban.user.XmlUserManager # User Rights Management Default Class

user.manager.xml.file = conf / azkaban-users.xml # user configuration, the specific configuration to participate in the following

Loader for projects = conf / # global The location of the configuration file

azkaban.project.dir = projects #

database.type = mysql # database type

mysql.port = 3306 # port number = www.ljt.cos01 # database connection IP

mysql.database = azkaban # database instance name

mysql.user = root # database user name

mysql.password = 123456 # database password

mysql.numconnections = 100 # maximum number of connections

Velocity dev mode = false

Jetty server properties.

jetty.maxThreads = 25 # The maximum number of threads

jetty.ssl.port = 8443 #Jetty SSL port

jetty.port = 8081 # Jetty port

jetty.keystore = keystore #SSL file name

jetty.password = 123456 #SSL file password

jetty.keypassword = 123456 The #Jetty master password is the same as the keystore file

jetty.truststore = keystore #SSL file name

jetty.trustpassword = 123456 # SSL file password

Executes the server properties

executor.port = 12321 # Executes the server port

E-Mail settings # send mail = # Send mail smtp address

mail.user = xxxxxxxx # The name displayed when sending mail

mail.password = **** # Email password # The address of the email sent when the job failed # The address of the email to send when the job is successful

lockdown.create.projects = false # = cache # cache directory

azkaban Executes the server executor configuration

Enter the implementation server installation directory conf, modify


Azkaban = Asia / Shanghai # timezone

Azkaban JobTypes plugin configuration

azkaban.jobtype.plugin.dir = plugins / jobtypes #jobtype plug-in location

Loader for projects = conf /

azkaban.project.dir = projects

Database settings

database.type = mysql # database type (currently only supports mysql)

mysql.port = 3306 # database port number = # database IP address

mysql.database = azkaban # database instance name

mysql.user = root # database user name

mysql.password = root # database password

mysql.numconnections = 100 # maximum number of connections

Perform server configuration

executor.maxThreads = 50 # The maximum number of threads

executor.port = 12321 # port number (if modified, please consistent with the web service)

executor.flow.threads = 30 # number of threads

User configuration

Enter the azkaban web server conf directory, modify azkaban-users.xml

vi azkaban-users.xml Add administrator user

start up

web server

Execute the start command in the azkaban webserver directory

bin /

Note: Run in the web server root directory

Or start to the background

nohup bin / 1> /tmp/azstd.out 2> /tmp/azerr.out &

Execute the server

Execute the start command under the Execution Server directory

bin /

Note: You can only run the server root directory

After starting up, you can access azkaban service by entering https: // server IP address: 8443 in your browser (Google Chrome is recommended), enter your new user name and password in login, and click login.

https: //www.ljt.cos01: 8443

Azkaban Plugins installed

Azkaban is designed to separate core functionality from extensions, so it's handy to install some useful plug-ins for Azkaban. There are currently the main types of plug-ins are the following:

Page visualization plugin. Such as HDFS View and Reportal

Trigger plugin

User management plug-in. You can customize some user login authentication features

Alarm plug-ins

Next, analyze the installation and configuration of each plug-in, some plug-ins installed on the Web, some plug-ins need to be installed on the Executor side.

1, HDFS Viewer Plugins

This plugin is installed on the web side.

(1) modify the configuration file

In [Math Processing Error] AK_HOME / plugins / viewer / hdfs path to find new plug-ins. So, we need [Math Processing Error] AK_HOME / plugins / viewer / hdfs / extlib / hadoop-core-1.0.4.jar, otherwise ClassNotFound will be reported wrong. After deletion, the plug-in's extlib path is already empty.

Copy the following jar from hadoop to the [Math Processing Error] HADOOP_HOME / etc / hadoop / core-site.xml file to [Math Processing Error] Use jar -uf hadoop-common-2.6.0.jar under Ak_HOME / extlib The core-site.xml command adds the core-site.xml file to the jar package

The final hadoop-common-2.6.0.jar file structure as shown below:

Start the Web Server, you can see the HDFS file directory on the page.

It is also possible to open a human readable file on hdfs directly.

The above htrace-core-3.0.4.jar jar package is also necessary, without this, it will be incorrect java.lang.NoClassDefFoundError: org / htrace / Trace, as shown below

2, Azkaban Jobtype

This plugin is installed on the Executor Server side.

Azkaban can run simple jobs such as Unix command-line operations as well as simple Java programs. If you need to run HadoopJava, Pig, Hive and other types of jobs, you need to install this plugin.

Installing the Jobtype plugin requires pre-installation of Hadoop and hive. My version of Hadoop is 2.6.0, installation path / usr / local / hadoop, Hive version 0.13.1, installation path / usr / local / hive

(1) decompress

Upload the previously downloaded azkaban-jobtype-2.5.0.tar.gz to m001's [Math Processing Error] AK_HOME / conf /, add a new line at the end

azkaban.jobtype.plugin.dir = plugins / jobtypes

This parameter tells Azkaban to the [Math Processing Error] AK_HOME / plugins / jobtypes / file to set the following three parameters:

hadoop.home = / usr / local / hadoop

hive.home = / usr / local / hive

pig.home =

azkaban.should.proxy = = [Math Processing Error] {hadoop.home} / share / hadoop / common /, [Math Processing Error] {hadoop.home} / share / hadoop / hdfs /, [Math Processing Error] {hadoop. hadoop.home and hive in [html_home / home / share / hadoop / yarn /, [Math Processing Error] {hadoop.home} /share/hosoop/mapreduce/, [Math Processing Error] AK_HOME / plugins / jobtypes / .home

(c) Edit [Math Processing Error] {hive.home} / conf, $ {hive.home} / lib / *

Finally, restart Azkaban Executor Server for the new plug-in to take effect.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.