Overview
2.1.1 Why do I need a workflow scheduling system?
A complete data analysis system is usually composed of a large number of task units:
shell script program, java program, mapreduce program, hive script and so on
The task units exist in time and before and after the dependencies
In order to organize such a complex execution plan well, a workflow scheduling system is needed to schedule the execution.
For example, we may have a demand, a business system produces 20G raw data every day, we deal with it every day, the processing steps are as follows:
1, Hadoop first through the original data synchronization to HDFS;
2, With the MapReduce computing framework to convert the original data, the generated data is stored in the form of a partition table to multiple Hive tables;
3, Need to JOIN the data of many tables in Hive, get a detail data Hive big table;
4, the detailed data of complex statistical analysis, the results reported information;
5, the results of statistical analysis needs to be synchronized to the business system data for business use.
2.1.2 workflow scheduling to achieve
Simple task scheduling: direct use of linux crontab to define;
Complex task scheduling: development scheduling platform
Or use a ready-made open source scheduling system, such as ooize, azkaban and so on
2.1.3 common workflow scheduling system
There are currently many workflow schedulers on the market
In the Hadoop domain, common workflow schedulers are Oozie, Azkaban, Cascading, Hamake, etc.
?
2.1.4 Comparison of various scheduling tools
The following table compares the four key features of the hadoop workflow scheduler. Although these workflow schedulers can basically meet the same demand scenarios, there are significant differences in terms of design concepts, target users, and application scenarios , When making technical selection, you can provide a reference
Characteristics Hamake Oozie Azkaban Cascading
Workflow Description Language XML XML (xPDL based) text file with key / value pairs Java API
Dependent mechanism data-driven explicit explicit explicit
Yes No Web container No Yes Yes No
Progress tracking console / log messages web page web page Java API
Hadoop job scheduling support no yes yes yes
Run command line utility daemon daemon API
Pig support yes yes yes yes
Event notification no no no yes
Need to install no yes yes no
Supported hadoop version 0.18+ 0.20+ currently unknown 0.18+
Retry support no workflownode evel yes yes
Run any command yes yes yes yes
Amazon EMR Support yes no currently unknown yes
2.1.5 Azkaban vs. Oozie
The two most popular schedulers on the market are given below in more detail for technical reference. Overall, ooize is a heavyweights task scheduling system compared to azkaban and is full featured, but the configuration is more complicated to use. If you can not care about the lack of certain features, lightweight scheduler azkaban is a very good candidate.
Details are as follows:
Features
Both can schedule mapreduce, pig, java, scripting workflow tasks
Both can be scheduled to perform workflow tasks
Workflow definition
Azkaban uses the Properties file to define the workflow
Oozie uses an XML file to define the workflow
Workflow reference
Azkaban supports direct pass parameters such as [Math Processing Error] {fs: dirSize (myInputDir)}
Timed execution
Azkaban's timing tasks are time-based
Oozie's timing tasks are based on time and entering data
Resource management
Azkaban has more stringent access control, such as user workflow read / write / execution and other operations
Oozie does not currently have strict access control
Workflow execution
Azkaban has two modes of operation: solo server mode (executor server and web server on the same node) and multi server mode (executor server and web server can be deployed on different nodes)
Oozie operates as a workflow server that supports multi-user and multi-workflow
Workflow management
Azkaban supports browser and ajax methods for workflow
Oozie supports command line, HTTP REST, Java API, browser action workflow
2.2 Introduction to Azkaban
Azkaban is a batch workflow task scheduler open sourced by Linkedin. Used to run a set of jobs and processes in a specific order within a workflow. Azkaban defines a KV file format to establish the dependencies between tasks and provides an easy-to-use web user interface to maintain and track your workflow.
It has the following features:
Web user interface
Convenient upload workflow
Easy to set the relationship between tasks
Scheduling workflow
Certification / Authorization (Permission to work)
Kill and restart the workflow
Modular and pluggable plug-in mechanism
Project workspace
Logging and auditing of workflow and tasks
Azkaban official website
It has three important components:
1. relational database (currently only supports mysql)
2. Web management server-AzkabanWebServer
3. Execution server-AzkabanExecutorServer
3 Azkaban Installation and Deployment
Ready to work
Azkaban Web Server
azkaban-web-server-2.5.0.tar.gz
Azkaban Executive Server
azkaban-executor-server-2.5.0.tar.gz
MySQL
At present, azkaban only supports mysql, need to install mysql server, this document has been installed by default mysql server, and the establishment of a root user, the root password.
Download address: http://azkaban.github.io/downloads.html
azkaban-executor-server-2.5.0.tar.gz (Execution Server)
azkaban-web-server-2.5.0.tar.gz (Management Server)
azkaban-sql-script-2.5.0.tar.gz (mysql script)
Azkaban has three modes of operation:
solo server mode: The simplest mode, the database built-in H2 database, management server and execution server are running in a process, the task is not large project can use this mode.
two server mode: database for mysql, management server and execution server in different processes, this mode, the management server and execution server do not affect each other
multiple executor mode: In this mode, the execution server and the management server are on different hosts, and there may be multiple execution servers.
Our project requirements is not too high, I use the second model this time, the management server, the implementation of the server sub-process, but on the same host.
installation
Upload the installation files to the cluster, the best upload to install hive, sqoop machine, to facilitate the implementation of the command
In the current user directory new azkabantools directory, used to store the source installation file. New azkaban directory for the azkaban run the program
azkaban web server installation
Unzip azkaban-web-server-2.5.0.tar.gz
Command: tar -zxvf azkaban-web-server-2.5.0.tar.gz
Unzip the azkaban-web-server-2.5.0 to the azkaban directory and rename the webserver
Command: mv azkaban-web-server-2.5.0 ../azkaban
cd ../azkaban
mv azkaban-web-server-2.5.0 server
azkaban executive server installation
Unzip azkaban-executor-server-2.5.0.tar.gz
Command: tar -zxvf azkaban-executor-server-2.5.0.tar.gz
Move the unzipped azkaban-executor-server-2.5.0 to the azkaban directory and rename the executor
Command: mv azkaban-executor-server-2.5.0 ../azkaban
cd ../azkaban
mv azkaban-executor-server-2.5.0 executor
Import azkaban script
Unzip: azkaban-sql-script-2.5.0.tar.gz
Command: tar -zxvf azkaban-sql-script-2.5.0.tar.gz
Will extract the mysql script into mysql:
Into mysql
mysql> create database azkaban;
mysql> use azkaban;
Database changed
mysql> source /home/hadoop/azkaban-2.5.0/create-all-sql-2.5.0.sql;
Create an SSL configuration
Reference Address: http://docs.codehaus.org/display/JETTY/How+to+configure+SSL
Configure SSL KeyStore
In the need to enter the password, enter the password parameter above configuration file, my password is 123456.
keytool-genkey-keystore keystore -alias jetty-azkaban -keyalg RSA -validity 3560
keytool -export -alias jetty-azkaban -keystore keystore -rfc -file selfsignedcert.cer
keytool -import -alias certificatekey -file selfsignedcert.cer -keystore truststore
Move the keystore and truststore files generated above to the $ AK_HOME / web path.
If you need to delete, use the following command.
keytool -delete -alias jetty-azkaban -keystore keystore -storepass azkaban
Command: keytool -keystore keystore -alias jetty -genkey -keyalg RSA
After running this command, you will be prompted to enter the current generated keystor password and the corresponding information, enter the password please labor, the following information:
Enter keystore password:
Enter the new password again:
What is your first name and last name?
[Unknown]:
What is your organization's name?
[Unknown]:
What is your organization's name?
[Unknown]:
What is the name of your city or region?
[Unknown]:
What is the name of your state or province?
[Unknown]:
What is the two-letter country code for this unit?
[Unknown]: CN
CN = Unknown, OU = Unknown, O = Unknown, L = Unknown, ST = Unknown, C = CN Correct?
[No]: y
Enter the master password
(If same as the keystore password, press Enter):
Enter the new password again:
After the completion of the above work, the keystore certificate file will be generated in the current directory, the keystore test to the azkaban web server root directory. Such as: cp keystore /opt/apps/azkaban/azkaban-web-2.5.0/web
Profile
Note: First configure the time zone on the server node
1, Mr. into the time zone configuration file Asia / Shanghai, with the interactive command tzselect can
2, copy the time zone file, covering the system local time zone configuration
cp / usr / share / zoneinfo / Asia / Shanghai / etc / localtime
azkaban web server configuration
Enter the azkaban web server installation directory conf directory
Modify the azkaban.properties file
Command vi azkaban.properties
The content is as follows:
Azkaban Personalization Settings
azkaban.name = TestAzkaban # server UI name for the name displayed above the server
azkaban.label = My Local Azkaban # Description
azkaban.color = # FF3601 #UI color
azkaban.default.servlet.path = / index #
web.resource.dir = web / # The default root web directory
default.timezone.id = Asia / Shanghai # The default time zone, changed to Asia / Shanghai defaults to the United States
Azkaban UserManager class
user.manager.class = azkaban.user.XmlUserManager # User Rights Management Default Class
user.manager.xml.file = conf / azkaban-users.xml # user configuration, the specific configuration to participate in the following
Loader for projects
executor.global.properties = conf / global.properties # global The location of the configuration file
azkaban.project.dir = projects #
database.type = mysql # database type
mysql.port = 3306 # port number
mysql.host = www.ljt.cos01 # database connection IP
mysql.database = azkaban # database instance name
mysql.user = root # database user name
mysql.password = 123456 # database password
mysql.numconnections = 100 # maximum number of connections
Velocity dev mode
velocity.dev.mode = false
Jetty server properties.
jetty.maxThreads = 25 # The maximum number of threads
jetty.ssl.port = 8443 #Jetty SSL port
jetty.port = 8081 # Jetty port
jetty.keystore = keystore #SSL file name
jetty.password = 123456 #SSL file password
jetty.keypassword = 123456 The #Jetty master password is the same as the keystore file
jetty.truststore = keystore #SSL file name
jetty.trustpassword = 123456 # SSL file password
Executes the server properties
executor.port = 12321 # Executes the server port
E-Mail settings
mail.sender=1329331182@qq.com # send mail
mail.host = smtp.163.com # Send mail smtp address
mail.user = xxxxxxxx # The name displayed when sending mail
mail.password = **** # Email password
job.failure.email=xxxxxxxx@163.com # The address of the email sent when the job failed
job.success.email=xxxxxxxx@163.com # The address of the email to send when the job is successful
lockdown.create.projects = false #
cache.directory = cache # cache directory
azkaban Executes the server executor configuration
Enter the implementation server installation directory conf, modify azkaban.properties
vi azkaban.properties
Azkaban
default.timezone.id = Asia / Shanghai # timezone
Azkaban JobTypes plugin configuration
azkaban.jobtype.plugin.dir = plugins / jobtypes #jobtype plug-in location
Loader for projects
executor.global.properties = conf / global.properties
azkaban.project.dir = projects
Database settings
database.type = mysql # database type (currently only supports mysql)
mysql.port = 3306 # database port number
mysql.host = 192.168.20.200 # database IP address
mysql.database = azkaban # database instance name
mysql.user = root # database user name
mysql.password = root # database password
mysql.numconnections = 100 # maximum number of connections
Perform server configuration
executor.maxThreads = 50 # The maximum number of threads
executor.port = 12321 # port number (if modified, please consistent with the web service)
executor.flow.threads = 30 # number of threads
User configuration
Enter the azkaban web server conf directory, modify azkaban-users.xml
vi azkaban-users.xml Add administrator user
start up
web server
Execute the start command in the azkaban webserver directory
bin / azkaban-web-start.sh
Note: Run in the web server root directory
Or start to the background
nohup bin / azkaban-web-start.sh 1> /tmp/azstd.out 2> /tmp/azerr.out &
Execute the server
Execute the start command under the Execution Server directory
bin / azkaban-executor-start.sh
Note: You can only run the server root directory
After starting up, you can access azkaban service by entering https: // server IP address: 8443 in your browser (Google Chrome is recommended), enter your new user name and password in login, and click login.
https: //www.ljt.cos01: 8443
Azkaban Plugins installed
Azkaban is designed to separate core functionality from extensions, so it's handy to install some useful plug-ins for Azkaban. There are currently the main types of plug-ins are the following:
Page visualization plugin. Such as HDFS View and Reportal
Trigger plugin
User management plug-in. You can customize some user login authentication features
Alarm plug-ins
Next, analyze the installation and configuration of each plug-in, some plug-ins installed on the Web, some plug-ins need to be installed on the Executor side.
1, HDFS Viewer Plugins
This plugin is installed on the web side.
(1) modify the configuration file
In [Math Processing Error] AK_HOME / plugins / viewer / hdfs path to find new plug-ins. So, we need [Math Processing Error] AK_HOME / plugins / viewer / hdfs / extlib / hadoop-core-1.0.4.jar, otherwise ClassNotFound will be reported wrong. After deletion, the plug-in's extlib path is already empty.
Copy the following jar from hadoop to the [Math Processing Error] HADOOP_HOME / etc / hadoop / core-site.xml file to [Math Processing Error] Use jar -uf hadoop-common-2.6.0.jar under Ak_HOME / extlib The core-site.xml command adds the core-site.xml file to the jar package
The final hadoop-common-2.6.0.jar file structure as shown below:
Start the Web Server, you can see the HDFS file directory on the page.
It is also possible to open a human readable file on hdfs directly.
The above htrace-core-3.0.4.jar jar package is also necessary, without this, it will be incorrect java.lang.NoClassDefFoundError: org / htrace / Trace, as shown below
2, Azkaban Jobtype
This plugin is installed on the Executor Server side.
Azkaban can run simple jobs such as Unix command-line operations as well as simple Java programs. If you need to run HadoopJava, Pig, Hive and other types of jobs, you need to install this plugin.
Installing the Jobtype plugin requires pre-installation of Hadoop and hive. My version of Hadoop is 2.6.0, installation path / usr / local / hadoop, Hive version 0.13.1, installation path / usr / local / hive
(1) decompress
Upload the previously downloaded azkaban-jobtype-2.5.0.tar.gz to m001's [Math Processing Error] AK_HOME / conf / azkaban.properties, add a new line at the end
azkaban.jobtype.plugin.dir = plugins / jobtypes
This parameter tells Azkaban to the [Math Processing Error] AK_HOME / plugins / jobtypes / common.properties file to set the following three parameters:
hadoop.home = / usr / local / hadoop
hive.home = / usr / local / hive
pig.home =
azkaban.should.proxy =
jobtype.global.classpath = [Math Processing Error] {hadoop.home} / share / hadoop / common /, [Math Processing Error] {hadoop.home} / share / hadoop / hdfs /, [Math Processing Error] {hadoop. hadoop.home and hive in [html_home / home / share / hadoop / yarn /, [Math Processing Error] {hadoop.home} /share/hosoop/mapreduce/, [Math Processing Error] AK_HOME / plugins / jobtypes / commonprivate.properties .home
(c) Edit [Math Processing Error] {hive.home} / conf, $ {hive.home} / lib / *
Finally, restart Azkaban Executor Server for the new plug-in to take effect.