Hadoop job submission analysis (1)

Source: Internet
Author: User

Http://www.cnblogs.com/spork/archive/2010/04/07/1706162.html

 

 
  Bin/Hadoop jar xxx.Jar mainclass ARGs
......

This command has already called NN. After you write a project or modify the project, you must create a jar package, then, use the above command to submit it to the hadoop cluster for running. This is extremely cumbersome in the development phase.ProgramThe employee is "lazy". Since it is troublesome, it must be necessary to reduce unnecessary keyboard percussion and extend the keyboard life. For example, some people write shell scripts to automatically compile, package, and submit them to hadoop. But it is still a little troublesome. The most convenient method is to use hadoop eclipse plugin to browse and manage HDFS and automatically create the Mr program template file. The best thing is to run on hadoop directly, however, the version cannot keep up with the master version of hadoop. The current Mr template is 0.19. Another software called hadoop studio seems to be quite powerful, but I have never tried it. I will not comment on it here. So how do they use the above command to submit jobs? Don't know? It doesn't matter. If you are open-source, you can directly look at the source code analysis. This is the biggest advantage of open-source software.

First, we will analyze the shell script bin/hadoop to see what the script has done internally and how to submit hadoop jobs.

Because it is a Java program, this script will eventually call Java to run, so the most important thing about this script is to add some pre-parameters, such as classpath. Therefore, we directly jump to the last line of the script to see if it has added those parameters, then analyze them one by one (this article ignores the analysis of environment parameter loading, Java search, and cygwin processing in the script ).

   code highlighting produced by actipro codehighlighter (freeware) 
http://www.CodeHighlighter.com/
--> # RUN it
exec " $ Java " $ java_heap_max $ hadoop_opts-classpath " $ classpath " $ class " $ @ "

From the above command, we can see that this script has finally added the following important parameters: java_heap_max, hadoop_opts, classpath, class. The following is an analysis (this article is based on cloudera hadoo P 0.20.1 + 152 analysis ).

The first is java_heap_max, which is relatively simple and mainly involvesCodeAs follows:

  Java_heap_max  =  -Xmx1000m
# Check envvars which might override default ARGs
If [ " $ Hadoop_heapsize " ! = "" ] ; Then
# Echo " Run with heapsize $ hadoop_heapsize "
Java_heap_max = " -Xmx "" $ Hadoop_heapsize "" M "
# Echo $ Java_heap_max
Fi

First, assign the default value-xmx 1000 m And then check if hadoop_heapsize is set and exported in the hadoop-env.sh, if so, overwrite it to get the final java_heap_max.

The next step is to analyze classpath, which is one of the key points of this script. This section adds the corresponding dependent libraries and configuration files to the classpath.

  #  First, use the hadoop configuration file directory to initialize classpath.
Classpath = " $ {Hadoop_conf_dir} "
......
# For the hadoop release, add the hadoop core jar package and webapps to classpath.
If [-D " $ Hadoop_home/webapps " ] ; Then
Classpath = $ {Classpath }: $ Hadoop_home
Fi
For F In $ Hadoop_home / Hadoop-*-core . Jar ; Do
Classpath = $ {Classpath }: $ F ;
Done
# Add the jar package in libs
For F In $ Hadoop_home / Lib / * . Jar ; Do
Classpath = $ {Classpath }: $ F ;
Done
For F In $ Hadoop_home / Lib / JSP- 2.1 / * . Jar ; Do
Classpath = $ {Classpath }: $ F ;
Done
# The following tool_path is added to classpath only when the command is "ARCHIVE ".
For F In $ Hadoop_home / Hadoop-*-Tools . Jar ; Do
Tool_path = $ {Tool_path }: $ F ;
Done
For F In $ Hadoop_home / Build / Hadoop-*-Tools . Jar ; Do
Tool_path = $ {Tool_path }: $ F ;
Done
# Finally, add your custom hadoop classpath.
If [ " $ Hadoop_classpath " ! = "" ] ; Then
Classpath = $ {Classpath }: $ {Hadoop_classpath}
Fi

The above analysis is only a part. Due to the long code, the classpath for the developer section is not listed.

The following is the focus and entity of this script: Class Analysis. The shell script sets the class and hadoop_opts according to the command parameters you enter. The class points to the class which is the entity that actually executes your command.

  #  Figure out which class  Run  
If [ " $ Command " = " Namenode " ] ; Then
Class = 'Org . Apache . Hadoop . HDFS . Server . Namenode . Namenode'
Hadoop_opts = " $ Hadoop_opts $ hadoop_namenode_opts "
......
Elif [ " $ Command " = " FS " ] ; Then
Class = Org . Apache . Hadoop . FS . Fsshell
Hadoop_opts = " $ Hadoop_opts $ hadoop_client_opts "
......
Elif [ " $ Command " = " Jar " ] ; Then
Class = Org . Apache . Hadoop . Util . Runjar
......
Elif [ " $ Command " = " Archive " ] ; Then
Class = Org . Apache . Hadoop . Tools . Hadooparchives
Classpath = $ {Classpath }: $ {Tool_path}
Hadoop_opts = " $ Hadoop_opts $ hadoop_client_opts "
......
Else
Class = $ Command
Fi

Here we are concerned about the corresponding class Org. apache. hadoop. util. runjar, this class and so on, we continue to analyze, this is our next intersection to the final goal.

The script also sets hadoop. log. dir, hadoop. log. file, and other hadoop_opts. Then, use the exec command to submit the task with the preceding parameters.

Through the above analysis, we know that if you want to replace this script, then, you must add the libraries and configuration file directories that hadoop depends on to the classpath at least (java_heap_max and hadoop_opts are not required), and then call Org. apache. hadoop. util. runjar class to submit jar to hadoop.

 

PS:Not familiar with bash shell can look at this http://learn.akae.cn/media/ch31s05.html first

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.