Description
Tasks performed in Hadoop sometimes require multiple map/reduce jobs to be connected together in order to achieve the goal. In the Hadoop ecosystem,Oozie allows us to combine multiple map/reduce jobs into a single logical unit of work, To accomplish larger tasks.
Principle
Oozie is a java Web application that runs in the Java servlet container- the Tomcat --in, and use the database to store the following:
Workflow definition
Currently running workflow instances, including the status and variables of the instance
The Oozie workflow is a set of actions placed in a control-dependent DAG(directed acyclic graph Direct acyclic graph) (for example,Hadoop map/reduce jobs,Pig jobs, and so on) that specify the order in which the actions are executed. We will use HPDL(an XML Process Definition language) to describe this diagram.
HPDLis a very concise language that uses only a handful of process control and action nodes. The control node defines the process of execution and includes the starting and ending points of the workflow (Start,Endand thefailnode) and the mechanism that controls the execution path of the workflow (decision,Forkand theJoinnode). Action nodes are mechanisms through which workflows trigger the execution of calculations or processing tasks. Oozieprovides support for the following types of actions:Hadoop map-reduce,Hadoopfile System,Pig,Javaand theOoziethe child workflow (SSHthe action has been taken fromOozie schema 0.2removed from the later version).
All compute and processing tasks triggered by the action node are notOozie-they are made up ofHadoopof theMap/reducethe framework executes. This approach allowsOoziecan support the existingHadoopa mechanism for load balancing, disaster recovery. These tasks are performed primarily asynchronously (only file system action exceptions, which are synchronous). This means that for most work flows to trigger the type of calculation or processing task, wait until the workflow action transitions to the next node of the workflow until the calculation or processing task is finished before it can continue. OozieThere are two different ways to detect whether a calculation or processing task is completed, that is, callbacks and polling. WhenOoziewhen a calculation or processing task is started, it provides a unique callback for the taskURL, and the task sends a notification when it is complete to a specificURL. The task cannot trigger a callbackURL(perhaps for any reason, such as a network flash), or when the type of the task fails to trigger a callback at completionURLthe time,Ooziethere is a mechanism for polling a calculation or processing task to ensure that the task can be completed.
Oozie Workflows can be parameterized (using variables like ${inputdir} in the workflow definition ). When submitting a workflow operation, we must supply the parameter value. If it is properly parameterized (say, using a different output directory), then many of the same workflow actions can be concurrent.
Some workflows are triggered as needed, but in most cases it is necessary to run them based on certain time periods and/or data availability and/or external events. Oozie Coordination System ( Coordinator System allows users to define workflow execution plans based on these parameters. the Oozie coordinator allows us to model workflow execution triggers in predicates, which can point to data, events, and/or external events. The workflow job starts when the predicate is satisfied.
Often we also need to connect workflow operations with timed runs, but with different intervals. The output of multiple subsequently running workflows becomes the input for the next workflow. Connecting these workflows together allows the system to refer to it as a conduit for data application. Oozie The Coordinator supports the creation of such a data application pipeline .
Installation
Installation Environment Hadoop2.6.0 maven3.3.9 pig0.15.0 jdk1.8 MySQL
1. Unzip to the app directory tar -zxf oozie-4.2.0-c app/
Compiling MVN clean Package assembly:single-p Hadoop-2-dskiptests
2. Unzip the compiled file tar -zxf oozie-4.2.0-distro.tar.gz ~/app/oozie/
3. Modify the HDFs configuration
Modify the core-site.xml under Hadoop
<property>
<name>hadoop.proxyuser. [user].hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser. [user].groups</name>
<value>*</value>
</property>
[USER] need to change back to start Oozie Tomcat of users
Do not restart the Hadoop cluster and make the configuration effective
HDFs dfsadmin-refreshsuperusergroupsconfiguration
Yarn Rmadmin-refreshsuperusergroupsconfiguration
4, configuration Oozie
A. create a new libext directory under the oozie-4.2.0 directory and copy the Ext-2.2.zip to the directory; Hadoop -related jar packages to this directory include JDBC 's jar Bar
CP $HADOOP _home/share/hadoop/*/*.jar libext/
CP $HADOOP _home/share/hadoop/*/lib/*.jar libext/
Remove the Hadoop and Tomcat conflict jar Packages
MV Servlet-api-2.5.jar Servlet-api-2.5.jar.bak
MV Jsp-api-2.1.jar Jsp-api-2.1.jar.bak
MV Jasper-compiler-5.5.23.jar Jasper-compiler-5.5.23.jar.bak
MV Jasper-runtime-5.5.23.jar Jasper-runtime-5.5.23.jar.bak
B, configure database connection, file is conf/oozie-site.xml
<property>
<name>oozie.service.JPAService.create.db.schema</name>
<value>true</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.driver</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.url</name>
<value>jdbc:mysql://node4:3306/oozie?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.username</name>
<value>root</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.password</name>
<value>root</value>
</property>
<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/usr/hadoop/hadoop-2.6.0/etc/hadoop</value>
</property>
C, pre-boot initialization
A. fight A war bag
Bin/oozie-setup.sh Prepare-war
B. initializing a database
bin/ooziedb.sh Create-sqlfile Oozie.sql-run
C. Modify The oozie-4.2.0/oozie-server/conf/server.xml file and comment out the following record
<!--<listener classname= "Org.apache.catalina.mbeans.ServerLifecycleListener"/>-->
D. upload the jar Package
bin/oozie-setup.sh sharelib Create-fs hdfs://master:9000
5. Start
bin/oozied.sh start
Case
MR Task Flow
1.
A, decompression oozie-examples.tar.gz to oozie-4.2.0
B, Vim examples/apps/map-reduce/job.properties
namenode=hdfs://master:9000
jobtracker=master:8032
Queuename=default
Examplesroot=examples
Oozie.wf.application.path=${namenode}/user/${user.name}/${examplesroot}/apps/map-reduce
Outputdir=map-reduce
C, modify Vim Examples/apps/map-reduce/workflow.xml
<property>
<name>mapred.map.tasks</name>
<value>2</value>
</property>
D. Submit a Task
Oozie Job-oozie Http://master:11000/oozie-config Examples/apps/map-reduce/job.properties-run
E, view
http://master:11000/oozie/
Http://master:8088/cluster
2.
Oozie Configuration of Hadoop