Original link: http://blog.ywheel.cn/post/2016/06/12/hive_in_oozie_workflow/
By building and maintaining big data platforms in the company and providing it to other data analysts, Hive is the most (almost unique) service that non-programmers use. Of course, in daily data processing, in order to simplify the coding effort and use the results accumulated by the data analyst, we can use or simply modify the HQL scripts they provide for data processing, and dispatch hive jobs using Oozie.
Here is a description of the hive action, as well as a record of the pits that have been trampled on. Hive Action
Adding a hive action to the workflow configuration of Oozie is straightforward.
The Hive action runs a hive job, and Oozie workflow waits for the hive job to complete before moving on to the next action. In hive action, you need to configure parameters such as Job-tracker, Name-node, hive scripts, and, of course, in hive action, you can also configure the HDFs directory to be created or deleted before starting the hive job.
In Oozie Workflow's hive action, you can also support parameter variables for the hive script, represented by the ${variables}.
The following is an example of the syntax for hive action in the official website:
<workflow-app name= "[Wf-def-name]" xmlns= "uri:oozie:workflow:0.1" > ... <action name= "[NODE-NAME]" >
Here are a few parameters in this syntax: prepare if you need to create or delete an HDFs directory before the hive job, you can increase the prepare parameter to specify the HDFS path that you want to create or delete. JOB-XML specifies the path on the HDFs where the Hive-site.xml is located, and if it is a CDH-built cluster, it can be found in the/etc/hive/conf directory on any hive gateway machine. If you do not specify the file path, hive action does not work. The configuration contains the parameters passed to the hive job, which can be used without the Config entry, which specifies the path on the HDFs where the hql script is located using the default configuration script, which is required by the hive action. In this hql script, you can use ${variables} to represent parameters, get the param parameter configuration defined in hive action param define the variable values required in the HQL script
Here's a sample of the hive action I use in my production environment:
<action name= "Hiveaction" >
Output data exceeds its limit
As you can see in the example above, the Hive action in Oozie workflow does not care about the hive query logic defined in the Hql file, and is as simple as possible in Oozie workflow, The logic correctness guarantee of hive and the guarantee of the success of job execution need to be done by HQL itself.
I was in a production environment and I had a weird problem: org.apache.oozie.action.hadoop.LauncherException:Output data exceeds its limit [2048]
Looking for a reason, the Oozie default maximum output data size is 2K, that is 2048B, and in hive action, the ID of the Mr Job submitted when the HQL script is executed (such as job_1464936467641_1657) is recorded and returned to Oozie , if in a hql script, too many hive query statements will cause Oozie to receive more than 2k of the resulting data, then this error is thrown. Workaround, add the following configuration to Oozie-site.xml, and restart the Oozie service to take effect:
<property>
<name>oozie.action.max.output.data</name>
<value>204800</value >
</property>
PS: If it is a CDH build cluster, then you can configure->oozie server Default group->->oozie-site.xml server in the cluster->oozie-> The above configuration is added to the Advanced Configuration Code snippet (safety valve).
PS: Tried to add such a configuration to the Config parameter in hive action, but the test found that it was not work. OutOfMemoryError
When executing a HQL script, the script contains multiple query statements (OK, hundreds of more), each of which has been tested to function properly and successfully ended, but put together by Oozie call after the following error:
Launching Job 613 out of 857 number of reduce tasks are set to 0 since there ' s no reduce operator Java.lang.OutOfMemoryErro R:java Heap space at Org.apache.hadoop.hdfs.util.bytearraymanager$newbytearraywithoutlimit.newbytearray ( bytearraymanager.java:308) at Org.apache.hadoop.hdfs.DFSOutputStream.createPacket (dfsoutputstream.java:192) at org . Apache.hadoop.hdfs.DFSOutputStream.writeChunk (dfsoutputstream.java:1883) at Org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks (fsoutputsummer.java:206) at Org.apache.hadoop.fs.FSOutputSummer.write1 (fsoutputsummer.java:124) at Org.apache.hadoop.fs.FSOutputSummer.write (fsoutputsummer.java:110) at Org.apache.hadoop.fs.fsdataoutputstream$positioncache.write (FSDataOutputStream.java : Java.io.DataOutputStream.write (dataoutputstream.java:107) at Org.apache.hadoop.io.IOUtils.copyBytes (ioutil s.java:87) at Org.apache.hadoop.io.IOUtils.copyBytes (ioutils.java:59) at org.apache.hadoop.io.IOUtils.copyBytes (IO Utils.java:119) at Org.apache.hadoop.fs.FileUtil.copy (fileutil.java:366) at Org.apache.hadoop.fs.FileUtil.copy (Fileu til.java:338) at Org.apache.hadoop.fs.FileSystem.copyFromLocalFile (filesystem.java:1905) at ORG.APACHE.HADOOP.FS.F Ilesystem.copyfromlocalfile (filesystem.java:1873) at Org.apache.hadoop.fs.FileSystem.copyFromLocalFile ( filesystem.java:1838) at Org.apache.hadoop.mapreduce.JobSubmitter.copyJar (jobsubmitter.java:375) at Org.apache.had Oop.mapreduce.JobSubmitter.copyAndConfigureFiles (jobsubmitter.java:256) at Org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles (jobsubmitter.java:390) at Org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal (jobsubmitter.java:483) at Org.apache.hadoop.mapreduce.job$10.run (job.java:1306) at Org.apache.hadoop.mapreduce.job$10.run (Job.java:1303) at
Java.security.AccessController.doPrivileged (Native Method) at Javax.security.auth.Subject.doAs (subject.java:415) At org.apache.hadoop.seCurity. Usergroupinformation.doas (usergroupinformation.java:1671) at Org.apache.hadoop.mapreduce.Job.submit (Job.java : 1303) at Org.apache.hadoop.mapred.jobclient$1.run (jobclient.java:564) at Org.apache.hadoop.mapred.jobclient$1.run (jobclient.java:559) at java.security.AccessController.doPrivileged (Native Method) at Javax.security.auth.Subject. DoAs (subject.java:415) at Org.apache.hadoop.security.UserGroupInformation.doAs (usergroupinformation.java:1671) at Org.apache.hadoop.mapred.JobClient.submitJobInternal (jobclient.java:559) failed:execution Error, return code-101 From Org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Java Heap Space MapReduce Jobs launched:
From the above log analysis, this hql contains many jobs that fail during execution (launching Job 613 out of 857). The way to solve this problem directly is to increase the memory. We know that the way that Oozie implements hive action is to start a launcher (a job with only a map), which is the client, to submit the hive task, and the job that actually processes the data is the Mr Job that hive submits. In this error, launcher happened outofmemoryerror.
The workaround is also simple, add the following configuration to the Hive Action's config to increase the launcher memory:
<configuration>
<property>
<name>oozie.launcher.mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name> oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx3400m</value>
</property >
</configuration>
In fact, the oozie.launcher prefix of these parameters is removed is the parameter configuration of Hadoop, Oozie when the launcher job is submitted, these parameters will be passed to yarn.
Of course, root cause or HQL scripts are not optimized, and one script contains too many queries. After a deep understanding of the business, streamlining hive queries, optimizing HQL scripts, and properly designing Oozie workflow is the right solution.