third, the use of Oozie periodic automatic execution of ETL
1. Oozie Introduction
(1) What is Oozie?
Oozie is a management Hadoop job, scalable, extensible, reliable workflow scheduling system, its workflow is composed of a series of actions made of a forward acyclic graph (DAGs), coordinator job is a time-frequency periodic trigger Oozie workflow job. The job types supported by Oozie are Java map-reduce, streaming map-reduce, Pig, Hive, Sqoop, and Distcp, as well as specific system jobs such as Java programs and shell scripts.
The first version of Oozie is a workflow engine-based server that runs workflow jobs by performing the actions of Hadoop map/reduce and pig jobs. The second edition of Oozie is a server based on the coordinator engine that triggers workflow execution by time and data. It can run the workflow continuously based on time (such as hourly execution) or data availability (such as waiting for input data to be completed). The third edition of Oozie is a bundle engine-based server. It provides a higher level of abstraction and batch processing of a range of coordinator applications. The user can start, stop, suspend, resume, and redo the coordinator job at the bundle level, which can make the operation control much easier.
(2) Why do I need oozie
- Tasks performed in Hadoop sometimes require multiple map/reduce jobs to be connected together or require multiple jobs to be processed in parallel. Oozie can combine multiple map/reduce jobs into a single logical unit of work to accomplish larger tasks.
- From the point of view of scheduling, if you call multiple workflow jobs using crontab, you may need to write a large number of scripts and script to control the execution timing of each workflow job, not only the script is not good to maintain, but also the monitoring is inconvenient. Based on this background, Oozie presents the concept of coordinator, which can run each workflow job as an action, equivalent to an execution node in the workflow definition, so that multiple workflow jobs can be composed into a job called Coordinator job. and specify the trigger time and frequency, you can also configure the data set, concurrency, and so on.
(3) Oozie architecture (excerpt from http://www.infoq.com/cn/articles/introductionOozie/)
The architecture of the Oozie is as shown.
Oozie is a Java Web application that runs in the Java servlet container-that is, tomcat---and uses the database to store the following:
- Workflow definition
- Currently running workflow instances, including the status and variables of the instance
The Oozie workflow is a set of actions placed in a control-dependent dag (directed acyclic graph Direct acyclic graph) that specifies the order in which the actions are executed, such as map/reduce jobs for Hadoop, pig jobs, and so on. We will use HPDL (an XML Process definition language) to describe this diagram.
HPDL is a very concise language that uses only a handful of process control and action nodes. The control node defines the process of execution and includes the starting and ending points of the workflow (start, end, and fail nodes) and the mechanism that controls the execution path of the workflow (decision, fork, and join nodes). Action nodes are mechanisms through which workflows trigger the execution of calculations or processing tasks. Oozie provides support for the following types of actions: Child workflows for Hadoop map-reduce, Hadoop file system, Pig, Java, and Oozie (SSH actions have been removed from versions after Oozie schema 0.2).
All compute and processing tasks triggered by the action node are not Oozie-they are executed by the Map/reduce framework of Hadoop. This approach allows Oozie to support existing Hadoop mechanisms for load balancing and disaster recovery. These tasks are performed primarily asynchronously (only file system action exceptions, which are synchronous). This means that for most work flows to trigger the type of calculation or processing task, wait until the workflow action transitions to the next node of the workflow until the calculation or processing task is finished before it can continue. Oozie can detect whether a calculation or processing task is completed in two different ways, that is, callbacks and polling. When Oozie initiates a calculation or processing task, it provides a unique callback URL for the task, and the task sends a notification to the specific URL when it is finished. When a task cannot trigger a callback URL (perhaps for any reason, such as a network flash), or when the type of the task cannot trigger the callback URL at completion, Oozie has a mechanism to poll the compute or processing tasks to ensure that the task is completed.
Oozie workflows can be parameterized (using variables like ${inputdir} in the workflow definition). When submitting a workflow operation, we must supply the parameter value. If it is properly parameterized (say, using a different output directory), then many of the same workflow actions can be concurrent.
Some workflows are triggered as needed, but in most cases it is necessary to run them based on certain time periods and/or data availability and/or external events. Oozie Coordination Systems (coordinator system) allow users to define a workflow execution plan based on these parameters. The Oozie coordinator allows us to model workflow execution triggers in predicates, which can point to data, events, and/or external events. The workflow job starts when the predicate is satisfied.
Often we also need to connect workflow operations with timed runs, but with different intervals. The output of multiple subsequently running workflows becomes the input for the next workflow. Connecting these workflows together allows the system to refer to it as a conduit for data application. The Oozie Coordinator supports the creation of such a data application pipeline.
(4) The Oozie in CDH 5.7.0
In CDH 5.7.0, the Oozie version is 4.1.0, and the metadata store uses MySQL. For the properties of Oozie in CDH 5.7.0, refer to the following links:
Https://www.cloudera.com/documentation/enterprise/latest/topics/cm_props_cdh570_oozie.html
2. Build a regular load workflow
(1) Modify resource configuration
You need to increase the values of the following two parameters:
YARN.NODEMANAGER.RESOURCE.MEMORY-MB = 2000YARN.SCHEDULER.MAXIMUM-ALLOCATION-MB = 2000
Otherwise, the execution workflow job times resembles the following error:
org.apache.oozie.action.ActionExecutorException:JA009: Org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException:Invalid resource request, requested memory < 0, or requested memory > max configured, requestedmemory=1536, maxmemory=1500
To do this, modify the relevant parameters from the CDH Web console, save the changes, and restart the cluster.
The YARN.NODEMANAGER.RESOURCE.MEMORY-MB parameter is in the NodeManager range of the yarn service, as shown in.
The YARN.SCHEDULER.MAXIMUM-ALLOCATION-MB parameter is in the ResourceManager range of the yarn service, as shown in.
Restart the cluster from the Web console as shown in the interface.
(2) Enable Oozie Web Console
The Oozie Web console is disabled by default, and it needs to be enabled for easy monitoring of the execution of Oozie jobs later. The "Enable Oozie server Web Console" parameter is in the main scope of the Oozie service, as shown in.
The specific practice is to:
- Download the installation ext-2.2.
- Modify the relevant parameters from the CDH Web console, save the changes, and restart the Oozie service.
Detailed steps refer to the following links: http://www.cloudera.com/documentation/enterprise/5-5-x/topics/admin_oozie_console.html
(3) Start the Sqoop share Metastore service
Regular load workflows need to be executed with Oozie call Sqoop, which requires the Sqoop metadata shared storage to be turned on as follows:
Sqoop metastore >/tmp/sqoop_metastore.log 2>&1 &
For questions about Oozie not running Sqoop job, refer to the following link: http://www.lamborryan.com/oozie-sqoop-fail/
(4) Connecting Metastore rebuilding Sqoop Job
The Sqoop job created earlier, whose metadata is not stored in the share Metastore, needs to be rebuilt using the following command.
Sqoop Job--show Myjob_incremental_import | grep incremental.last.valuesqoop Job--delete myjob_incremental_importsqoop Job--meta-connect jdbc:hsqldb:hsql:// Cdh2:16000/sqoop--create Myjob_incremental_import--Import--connect "Jdbc:mysql://cdh1:3306/source?usessl=false &user=root&password=mypassword "--table sales_order--columns" Order_number, Customer_number, Product_code, Order_date, Entry_date, Order_amount "--hive-import--hive-table rds.sales_order--incremental Append--check-column Order_number--last-value 116
Where Last-value is the value after the last ETL execution, the value can be seen with the first command.
(5) Define Workflow
Create the following workflow.xml file:
<?xml version= "1.0" encoding= "UTF-8"? ><workflow-app xmlns= "uri:oozie:workflow:0.1" name= "Regular_etl" > <start to= "Fork-node"/> <fork name= "Fork-node" > <path start= "Sqoop-customer"/> <pat H start= "Sqoop-product"/> <path start= "Sqoop-sales_order"/> </fork> <action name= "sqoop-c Ustomer "> <sqoop xmlns=" uri:oozie:sqoop-action:0.2 "> <job-tracker>${jobtracker}</job-t Racker> <name-node>${nameNode}</name-node> <arg>import</arg> & Lt;arg>--connect</arg> <arg>jdbc:mysql://cdh1:3306/source?useSSL=false</arg> &L T;arg>--username</arg> <arg>root</arg> <arg>--password</arg> <arg>mypassword</arg> <arg>--table</arg> <arg>customer</arg> ; <arg>--hiVe-import</arg> <arg>--hive-table</arg> <arg>rds.customer</arg> <arg>--hive-overwrite</arg> <file>/tmp/hive-site.xml#hive-site.xml</file& Gt <archive>/tmp/mysql-connector-java-5.1.38-bin.jar#mysql-connector-java-5.1.38-bin.jar</archive> </sqoop> <ok to= "joining"/> <error to= "fail"/> </action><action name= "sqoop-pr Oduct "> <sqoop xmlns=" uri:oozie:sqoop-action:0.2 "> <job-tracker>${jobtracker}</job-tra cker> <name-node>${nameNode}</name-node> <arg>import</arg> < ;arg>--connect</arg> <arg>jdbc:mysql://cdh1:3306/source?useSSL=false</arg> < Arg>--username</arg> <arg>root</arg> <arg>--password</arg> <arg>mypassword</arg> <arg>--table</arg> <arg>product</arg> <ar G>--hive-import</arg> <arg>--hive-table</arg> <arg>rds.product</arg> ; <arg>--hive-overwrite</arg> <file>/tmp/hive-site.xml#hive-site.xml</file> <archive>/tmp/mysql-connector-java-5.1.38-bin.jar#mysql-connector-java-5.1.38-bin.jar</archive> </sqoop> <ok to= "joining"/> <error to= "fail"/> </action> <action name= "s Qoop-sales_order "> <sqoop xmlns=" uri:oozie:sqoop-action:0.2 "> <job-tracker>${jobtracker}& lt;/job-tracker> <name-node>${nameNode}</name-node> <command>job--exec myjob_i Ncremental_import--meta-connect jdbc:hsqldb:hsql://cdh2:16000/sqoop</command> <file>/tmp/hive-site . xml#hive-sIte.xml</file> <archive>/tmp/mysql-connector-java-5.1.38-bin.jar#mysql-connector-java-5.1.38-bin.ja r</archive> </sqoop> <ok to= "joining"/> <error to= "fail"/> </action> <join name= "Joining" to= "Hive-node"/> <action name= "Hive-node" > Its DAG as shown.
The workflow consists of 9 nodes, with 5 control nodes, 4 action nodes: The start of the workflow (start), the end point (end), the failed processing node (not shown in the Fail,dag diagram), and two execution path control nodes (Fork-node and joining, Fork and join nodes must be paired), three parallel processing sqoop action nodes (Sqoop-customer, Sqoop-product, Sqoop-sales_order) as data extraction, A Hive Action node (hive-node) is used for data conversion and loading.
(6) Deployment workflow
HDFs dfs-put-f Workflow.xml/user/root/hdfs Dfs-put/etc/hive/conf.cloudera.hive/hive-site.xml/tmp/hdfs dfs-put/root /mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar/tmp/hdfs dfs-put/root/regular_etl.sql/tmp/
(7) Create a Job properties file
Create the following job.properties file:
Namenode=hdfs://cdh2:8020jobtracker=cdh2:8032queuename=defaultoozie.use.system.libpath= Trueoozie.wf.application.path=${namenode}/user/${user.name}
(8) Running a workflow
Oozie Job-oozie Http://cdh2:11000/oozie-config/root/job.properties-run
The running job is visible from the Oozie Web console, as shown in.
Click on the row of the job where you can open the Details window for the job as shown in.
Click on the line of action to open the Action Details window as shown in.
You can click the icon to the right of the console URL to open the Tracking window for the Map/reduce job, as shown in.
When the Oozie job finishes executing, you can see from the "All Jobs" tab that the Status column has changed from running to succeeded as shown.
When you view the data for the Cdc_time table, you can see that the date has been changed to the current date, as shown in.
3. Establish a coordination job to automate the workflow on a regular basis
(1) Create a reconcile Job properties file
Create the following job-coord.properties file:
Namenode=hdfs://cdh2:8020jobtracker=cdh2:8032queuename=defaultoozie.use.system.libpath= Trueoozie.coord.application.path=${namenode}/user/${user.name}timezone=utcstart=2016-07-11t06:00zend= 2020-12-31t07:15zworkflowappuri=${namenode}/user/${user.name}
(2) Set up the coordination job configuration file
Create the following Coordinator.xml file:
<coordinator-app name= "Regular_etl-coord" frequency= "${coord:days (1)}" start= "${start}" end= "${end}" timezone= "$ {timezone} "xmlns=" uri:oozie:coordinator:0.1 "> <action> <workflow> <app-path>${w orkflowappuri}</app-path> <configuration> <property> <na Me>jobtracker</name> <value>${jobTracker}</value> </property> <property> <name>nameNode</name> <value>${na menode}</value> </property> <property> <name>queu Ename</name> <value>${queueName}</value> </property> </configuration> </workflow> </action></coordinator-app>
(3) Deployment coordination jobs
HDFs dfs-put-f coordinator.xml/user/root/
(4) Run the coordination job
Oozie Job-oozie Http://cdh2:11000/oozie-config/root/job-coord.properties-run
From the Oozie Web console, you can see the coordinated jobs ready to run, with the status of Prep as shown in.
This coordination job starts on July 11, 2016 and executes 14 points per day. The end date is very late, which is set for December 31, 2020. Be aware of the time zone settings. Oozie The default time zone is UTC and does not work even if timezone=gmt+0800 is set in the properties file, so the Start property is set to 06:00, which is actually 14:00 GMT.
When the time arrives 14:00, the reconcile job starts running, and the state changes from Prep to running, as shown in.
Click on the row of the job where you can open the Details window for the reconcile job as shown in.
Click the row for the reconcile job to open the Details window for the workflow job, as shown in.
Click on the line of action to open the Action Details window as shown in.
You can click the icon to the right of the console URL to open the Tracking window for the Map/reduce job, as shown in.
Here's a general method for automating ETL execution on a regular basis using Oozie. Oozie 4.1.0 's Official Document link address is as follows: http://oozie.apache.org/docs/4.1.0/index.htmlThe practice of data Warehouse based on Hadoop ecosystem--etl (iii)