Author Boris Lublinsky, Michael Segel , translator Surtani released on August 18, 2011 | Note:Qcon Global Software Development Conference (Beijing) April 2016 21-23rd, Learn more!
- Share to: Weibo facebooktwitter Youdao Cloud Note email sharing
Read later
My list of reading
Tasks performed in Hadoop sometimes require multiple map/reduce jobs to be connected together in order to achieve the goal. [1] in the Hadoop ecosystem, there is a relatively new component called Oozie[2], which enables us to combine multiple map/reduce jobs into a single logical unit of work to accomplish larger tasks. In this article, we'll introduce you to Oozie and some of the ways you can use it.
What is Oozie?
Oozie is a Java Web application that runs in the Java servlet container-that is, tomcat---and uses the database to store the following:
- Workflow definition
- Currently running workflow instances, including the status and variables of the instance
The Oozie workflow is a set of actions placed in a control-dependent dag (directed acyclic graph Direct acyclic graph) that specifies the order in which the actions are executed, such as map/reduce jobs for Hadoop, pig jobs, and so on. We will use HPDL (an XML Process definition language) to describe this diagram.
Related Vendor Content
A variety of best practices and typical ideas for improving engineering efficiency knowing that vice president of Chong Woo Technology, cosine will act as Qcon Beijing 2016 producer Qcon Beijing 2016 conference, April 21-23rd, meet with you in Beijing International Conference Center, now sign up to enjoy 80 percent discount!
Related Sponsors
Qcon Beijing 2016 conference, April 21-23rd, Beijing International Conference Center, exciting content invites you to participate!
HPDL is a very concise language that uses only a handful of process control and action nodes. The control node defines the process of execution and includes the starting and ending points of the workflow (start, end, and fail nodes) and the mechanism that controls the execution path of the workflow (decision, fork, and join nodes). Action nodes are mechanisms through which workflows trigger the execution of calculations or processing tasks. Oozie provides support for the following types of actions: Child workflows for Hadoop map-reduce, Hadoop file system, Pig, Java, and Oozie (SSH actions have been removed from versions after Oozie schema 0.2).
All compute and processing tasks triggered by the action node are not Oozie-they are executed by the Map/reduce framework of Hadoop. This approach allows Oozie to support existing Hadoop mechanisms for load balancing and disaster recovery. These tasks are performed primarily asynchronously (only file system action exceptions, which are synchronous). This means that for most work flows to trigger the type of calculation or processing task, wait until the workflow action transitions to the next node of the workflow until the calculation or processing task is finished before it can continue. Oozie can detect whether a calculation or processing task is completed in two different ways, that is, callbacks and polling. When Oozie initiates a calculation or processing task, it provides a unique callback URL for the task, and the task sends a notification to the specific URL when it is finished. When a task cannot trigger a callback URL (perhaps for any reason, such as a network flash), or when the type of the task cannot trigger the callback URL at completion, Oozie has a mechanism to poll the compute or processing tasks to ensure that the task is completed.
Oozie workflows can be parameterized (using variables like ${inputdir} in the workflow definition). When submitting a workflow operation, we must supply the parameter value. If it is properly parameterized (say, using a different output directory), then many of the same workflow actions can be concurrent.
Some workflows are triggered as needed, but in most cases it is necessary to run them based on certain time periods and/or data availability and/or external events. Oozie Coordination Systems (coordinator system) allow users to define a workflow execution plan based on these parameters. The Oozie coordinator allows us to model workflow execution triggers in predicates, which can point to data, events, and/or external events. The workflow job starts when the predicate is satisfied.
Often we also need to connect workflow operations with timed runs, but with different intervals. The output of multiple subsequently running workflows becomes the input for the next workflow. Connecting these workflows together allows the system to refer to it as a conduit for data application. The Oozie Coordinator supports the creation of such a data application pipeline.
Installing Oozie
We can install Oozie in an existing Hadoop system, including Tarball, RPM, and Debian packages. Our Hadoop deployment is the CDH3 of Cloudera, which already contains Oozie. So we just use Yum to pull it down and then perform the installation on the Edge node [1]. There are two components--oozie-client and Oozie-server in the Oozie release package. Depending on the size of the cluster, you can have the two components installed on the same edge server or on a different computer. The Oozie server contains components for triggering and controlling jobs, while the client contains components that allow the user to trigger Oozie operations and communicate with the Oozie server.
To learn more about the installation process, use Cloudera to publish the package and access the Cloudera site [2].
Note: In addition to the contents of the installation process, it is recommended that the following shell variable Oozie_url be added to the. Login,. KSHRC, or shell startup files as needed:
(Export Oozie_url=http://localhost:11000/oozie)
Simple example
To show you how to use Oozie, let's create a simple example. We have two map/reduce jobs [3]--One gets the initial data and the other merges the data of the specified type. The actual fetch operation needs to perform the initial fetch operation, and then merge the two types of data--lidar and multicam--. In order to automate this process, we need to create a simple Oozie workflow (code 1).
<!--Copyright (c) navteq! Inc. All rights reserved. NGMB IPS ingestor Oozie script--><workflow-app xmlns= ' uri:oozie:workflow:0.1 ' name= ' ngmb-ips-ingestion ' > <start to= ' ingestor '/> <action name= ' ingestor ' > <java> <job-tracker>${jobtrac Ker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <VALUE&G t;default</value> </property> </configuration> <main-class>co M.navteq.assetmgmt.mapreduce.ips.ipsloader</main-class> <java-opts>-Xmx2048m</java-opts> <arg>${driveID}</arg> </java> <ok to= "Merging"/> <error to= "fail"/& Gt </action> <fork name= "Merging" > <path start= "Mergelidar"/> <path start= "Mergesignage"/> </fork> <action name= ' Mergelidar ' > <java> < ;job-tracker>${jobtracker}</job-tracker> <name-node>${nameNode}</name-node> < ;configuration> <property> <name>mapred.job.queue.name</name> <value>default</value> </property> </configuration> <main-class>com.navteq.assetmgmt.hdfs.merge.MergerLoader</main-class> <java-opts>-xmx204 8m</java-opts> <arg>-drive</arg> <arg>${driveID}</arg> < arg>-type</arg> <arg>Lidar</arg> <arg>-chunk</arg> <a rg>${lidarchunk}</arg> </java> <ok to= "completed"/> <error to= "fail"/> </act Ion> <Action name= ' mergesignage ' > <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>default</value> </property> </configuration> <main-class>com.navteq.assetmgmt.hdfs.merge.me Rgerloader</main-class> <java-opts>-Xmx2048m</java-opts> <arg>-drive</arg > <arg>${driveID}</arg> <arg>-type</arg> <ARG>MULTICAM&L t;/arg> <arg>-chunk</arg> <arg>${signageChunk}</arg> </java> <ok to= "Completed"/> <error to= "fail"/> </action> <join name= "completed" to= "end "/> <kill name="Fail" > <message>java failed, Error Message[${wf:errormessage (Wf:lasterrornode ())}]</message> < ;/kill> <end name= ' End '/></workflow-app>
Code Listing 1: a simple Oozie workflow
This workflow defines three actions: Ingestor, Mergelidar, and Mergesignage. And each action is implemented as map/reduce[4] job. This workflow starts from the start node and then gives control to the Ingestor action. Once the Ingestor step is complete, the Fork Control node [4] is triggered and it starts executing mergelidar and mergesignage[5 in parallel. When these two actions are complete, the join control node is triggered [6]. Once the join node completes successfully, control is passed to the end node, which ends the process.
After the workflow is created, we need to deploy it correctly. A typical Oozie deployment is an HDFs directory that contains the Workflow.xml (code 1), Config-default.xml, and Lib subdirectories that contain the jar files for the classes to be used by the workflow operation.
(Click to see a larger image)
Figure 1:oozie Deployment
The Config-default.xml file is optional and typically contains workflow parameters that are common to all workflow instances. A simple example of Config-default.xml is shown in code 2.
<configuration> <property> <name>jobTracker</name> <value>sachicn003 :2010</value> </property> <property> <name>nameNode</name> < value>hdfs://sachicn001:8020</value> </property> <property> <name> queuename</name> <value>default</value> </property></configuration>
Code 2:config-default.xml
Once the workflow has been deployed, we can use the command-line tool [5] provided by Oozie, which can be used to submit, start, and manipulate workflows. This tool typically runs on the edge node of Hadoop cluster [7] and requires a job properties file (see Configuring Workflow properties), see Code 3.
Oozie.wf.application.path=HDFs://sachicn001:8020/user/blublins/Workflows/ Ipsingestionjobtracker=sachicn003:2010namenode=hdfs://sachicn001:8020
Code Listing 3: Job Properties file
With the job properties, we can use the commands in code 4 to run the Oozie workflow.
Oozie Job–oozie http://sachidn002.hq.navteq.com:11000/oozie/-D driveid=729-pp00002-2011-02-08-09-59-34-d lidarchunk=4-d Signagechunk=20-config Job.properties–run
Listing 4: Running a workflow command
Configure Workflow Properties There are some overlaps in config-default.xml, job properties files, and job parameters that can be passed to Oozie as part of a command-line invocation. Although the documentation does not clearly indicate when to use which, the overall recommendations are as follows:
- Use Config-default.xml to define parameters that have never been changed for a specified workflow.
- Job properties are recommended for parameters that are common to a given workflow deployment.
- Uses command-line arguments for the specified workflow invocation of a specific parameter.
Oozie handles these three parameters in the following way:
- Parameters that are invoked using all command lines
- If there are any unresolved parameters, then the job configuration is used to resolve
- Once all other options are not available, try using CONFIG-DEFAULT.XM.
|
We can use the Oozie console (Figure 2) to observe the process and results of workflow execution.
(Click to see a larger image)
Figure 2:oozie Console
We can also use the Oozie console to get details of the execution of the operation, such as the log of the job [8] (Figure 3).
(Click to see a larger image)
Figure 3:oozie Console-Job log
Programmatic Workflow Invocation
Although the command-line interface described above can be used to invoke Oozie manually, it is sometimes more advantageous to call Oozie programmatically. This can be useful when the Oozie workflow is part of a particular application or a large enterprise process. We can use Oozie Web Services APIs [6] or Oozie Java client APIs [7] to implement this programmatic invocation. Code 5 shows a simple example of a Oozie Java client that triggers the process described above.
Package Com.navteq.assetmgmt.oozie;import Java.util.linkedlist;import Java.util.list;import java.util.Properties; Import Org.apache.oozie.client.oozieclient;import Org.apache.oozie.client.oozieclientexception;import Org.apache.oozie.client.workflowjob;import Org.apache.oozie.client.workflowjob.status;public class WorkflowClient {private static String Oozie_url = "http://sachidn002.hq.navteq.com:11000/oozie/"; private static String Job_path = "Hdfs://sachicn001:8020/user/blublins/workflows/ipsingestion"; private static String Job_tracker = "sachicn003:2010"; private static String NAMENode = "hdfs://sachicn001:8020"; Oozieclient WC = null; Public workflowclient (String url) {WC = new oozieclient (URL); public string Startjob (string wfdefinition, list<workflowparameter> wfparameters) throws Oozieclientexce ption{//Create a workflow job configuration and set the workflow application path Properties conf = Wc.cre Ateconfiguration (); Conf.setproperty (Oozieclient.app_path, wfdefinition); Setting Workflow Parameters Conf.setproperty ("Jobtracker", Job_tracker); Conf.setproperty ("NameNode", NameNode); if ((wfparameters! = null) && (wfparameters.size () > 0)) {for (Workflowparameter Parameter:wfparamet ERS) Conf.setproperty (Parameter.getname (), Parameter.getvalue ()); }//Submit and start the workflow job return Wc.run (conf); The public Status getjobstatus (String jobID) throws oozieclientexception{workflowjob job = Wc.getjobinfo (JobID); return Job.getstatus (); } public static void Main (string[] args) throws Oozieclientexception, interruptedexception{//Create Client Workflowclient client = new Workflowclient (Oozie_url); Create parameters list<workflowparameter> wfparameters = new linkedlist<workflowparameter> (); Workflowparameter drive = new WorkflowparAmeter ("DriveID", "729-pp00004-2010-09-01-09-46"); Workflowparameter lidar = new Workflowparameter ("Lidarchunk", "4"); Workflowparameter signage = new Workflowparameter ("Signagechunk", "4"); Wfparameters.add (drive); Wfparameters.add (LiDAR); Wfparameters.add (signage); Start oozing String jobId = Client.startjob (Job_path, wfparameters); Status status = Client.getjobstatus (JobId); if (status = = status.running) System.out.println ("Workflow job RUNNING"); else System.out.println ("Problem starting Workflow job"); }}
Code Listing 5: Simple Oozie Java Client
Here, we first initialize the workflow client with the Oozie server URL. After the initialization process is complete, we can use the client to submit and start the job (the Startjob method), get the status of the running Job (GetStatus method), and do other things.
Building Java actions, passing parameters to workflows
In the previous example, we have shown how to use tags to pass parameters to a Java node. Since Java nodes are the primary means of introducing custom computations to Oozie, it is equally important to be able to pass data from Java nodes to Oozie.
According to the Java node's documentation [3], we can use the "capture-output" element to pass the values generated by the Java node back to the Oozie context. The other steps of the workflow can then access these values through El-functions. The return value needs to be written in a Java attribute format file. We can get the name of these property files from the system property by the "Javamainmapper.oozie_java_main_capture_output_file" constant. Code 6 is a simple example that demonstrates how to do this.
Package Com.navteq.oozie;import Java.io.file;import Java.io.fileoutputstream;import java.io.outputstream;import Java.util.calendar;import Java.util.gregoriancalendar;import Java.util.properties;public class GenerateLookupDirs { /** * @param args */public static final Long Daymillis = 1000 * 60 * 60 * 24; private static final String oozie_action_output_properties = "Oozie.action.output.properties"; public static void Main (string[] args) throws Exception {Calendar curdate = new GregorianCalendar (); int year, month, date; String Propkey, PropVal; String Oozieprop = System.getproperty (oozie_action_output_properties); if (Oozieprop! = null) {File Propfile = new File (Oozieprop); Properties Props = new properties (); for (int i = 0; I < 8; ++i) {year = Curdate.get (calendar.year); month = Curdate.get (calendar.month) + 1; Date = Curdate.get (calendar.date); Propkey = "dir" +i; PropVal = year + "-" + (Month < 10?) "0" + month:month) + "-" + (Date < 10?) "0" + date:date); Props.setproperty (Propkey, propval); Curdate.settimeinmillis (Curdate.gettimeinmillis ()-Daymillis); } outputstream OS = new FileOutputStream (propfile); Props.store (OS, ""); Os.close (); } else throw new RuntimeException (Oozie_action_output_properties + "System property Not defined"); }}
Code Listing 6: Passing parameters to Oozie
In this example, we assume that there are directories for each date in HDFs. This class first obtains the current date, then the nearest 7 date (including today), and then passes the directory name back to Oozie.
Conclusion
In this article we describe Oozie, which is a workflow engine for Hadoop and provides a simple example of using it. In the next article, we'll look at more complex examples, allowing us to further discuss the features of Oozie.
Thanks
Thank you very much for our colleague Gregory Titievsky in Navteq, who provided us with some examples.
About the author
Boris Lublinsky is the chief architect of Navteq Corporation, where his work is a vision for large-scale data management and processing, SOA, and the implementation of various NAVTEQ project definition architectures. He is also a contributor to Infoq's SOA editor and Oasis's SOA RA team. Boris is an author and often speaks, and his latest book is applied SOA.
Michael Segel has been writing with clients for the past more than 20 years to identify and solve their business problems. Michael has worked in multiple industries as a multi-role player. He is an independent consultant and always expects to be able to solve all the challenging problems. Michael has a software engineering degree from Ohio State University.
[The 1]edge node is a computer with a Hadoop library installed, but not part of a true cluster. It is used for applications that can connect to clusters, and it deploys ancillary services and end-user applications that have direct access to the cluster set.
[2] See the Oozie installation link.
[3] The details of these assignments are not related to this article, so there is no description in them.
[4] The Map/reduce job can be implemented in Oozie in two different ways-the first is as a real map/reduce action [2], where you specify the mapper and reducer classes and their configuration information, and the second is as a Java action [3], Where you use the Hadoop API to specify the classes that start the map/reduce job. Since all of our Java main functions are using the Hadoop API and we have implemented some additional functionality, we have chosen the second method.
[5] Oozie ensures that two actions are submitted to the job tracker in parallel. The actual parallel mechanism in execution is not within the control of Oozie, and depends on the job requirements, the ability of the cluster, and the scheduler used by the Map/reduce deployment.
[The function of the 6]join action is to synchronize multiple threads that are executed concurrently by the fork action. If all executing threads that the fork initiates are successful, the join action waits for them to complete. If at least one of the threads fails to execute, the kill node "kills" the remaining running threads.
[7] This node does not need to be a computer with Oozie installed.
[8] Oozie's job log contains the details of the workflow execution and wants to see the details of the action execution, we need to switch to the Map/reduce management page of Hadoop.
View English text: Introduction to Oozie
To contribute to or participate in the content translation of the Infoq Chinese station, please email to [email protected]. You are also welcome to join the INFOQ Chinese Station user discussion group to communicate with our editors and other reader friends.
Oozie Getting Started