Hadoop In The Big Data era (II): hadoop script Parsing

Last Update:2014-10-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop In The Big Data era (1): hadoop Installation

If you want to have a better understanding of hadoop, you must first understand how to start or stop the hadoop script. After all,Hadoop is a distributed storage and computing framework.But how to start and manage the distributed environment? Let's start with the script first. To be honest, the hadoop STARTUP script is well written, and the considerations are very comprehensive (for example, there are spaces in the path and soft connections ).

1. hadoop script Introduction

Hadoop scripts are distributed under the bin directory under $ hadoop_home and the conf folder. The main introduction is as follows:

Bin directory
Hadoop core script, all distributed programs are finally started through this script.
The basic scripts of the hadoop-config.sh are embedded to call this script, which is used to parse the optional parameters of the command line (-- config: hadoop conf folder path and -- hosts)
The hadoop-daemon.sh starts or stops the distributed program specified by the local command parameter by calling the hadoop script.
The hadoop-daemons.sh starts hadoop distributed programs on all machines by calling slaves. Sh.
Slaves. Sh runs a set of specified commands on all machines (using SSH without a password) for upper-layer use.
The start-dfs.sh starts namenode on the local machine, starts datanode on the slaves machine, and starts secondarynamenode on the master machine by calling the hadoop-daemon.sh and hadoop-daemons.sh.
The start-mapred.sh starts jobtracker on the local machine, starts tasktracker on the slaves machine, and is done by calling the hadoop-daemon.sh and hadoop-daemons.sh.
Start-all.sh starts all distributed hadoop programs by calling start-dfs.sh and start-mapred.sh.
The start-balancer.sh starts the hadoop distributed environment complex balanced scheduling program, balancing the storage and processing capabilities of each node.
There are several stop scripts, so you don't need to elaborate.

Under the conf directory
The hadoop-env.sh configures some parameter variables required for hadoop runtime, such as java_home, hadoop_log_dir, hadoop_pid_dir, etc.

2. the charm of the script (detailed explanation) hadoop's script is really good to write, not satisfied, and I learned a lot from it.
2.1 hadoop-config.sh this script is relatively simple, and basic other scripts are embedded through ". $ bin/hadoop-config.sh "calls this script, so this script does not have to declare the right of interpretation in the first line, this call method is similar to copying the script content to the parent script and running it in the same interpreter.
This script mainly includes three parts:

1. soft connection resolution and absolute path resolution

#软连接解析this="$0"while [ -h "$this" ]; do  ls=`ls -ld "$this"`  link=`expr "$ls" : ‘.*-> \(.*\)$‘`  if expr "$link" : ‘.*/.*‘ > /dev/null; then    this="$link"  else    this=`dirname "$this"`/"$link"  fidone#绝对路径解析# convert relative path to absolute pathbin=`dirname "$this"`script=`basename "$this"`bin=`cd "$bin"; pwd`this="$bin/$script"# the root of the Hadoop installationexport HADOOP_HOME=`dirname "$this"`/..

2. Command Line optional parameter -- config resolution and value assignment

#check to see if the conf dir is given as an optional argumentif [ $# -gt 1 ]then    if [ "--config" = "$1" ]  then      shift      confdir=$1      shift      HADOOP_CONF_DIR=$confdir    fifi

3. Command Line optional parameter -- config resolution and value assignment

#check to see it is specified whether to use the slaves or the# masters fileif [ $# -gt 1 ]then    if [ "--hosts" = "$1" ]    then        shift        slavesfile=$1        shift        export HADOOP_SLAVES="${HADOOP_CONF_DIR}/$slavesfile"    fifi

2.2 hadoop
This script is the core of the hadoop script. It is used to set variables and start programs.
1. Declare usage

# if no args specified, show usageif [ $# = 0 ]; then  echo "Usage: hadoop [--config confdir] COMMAND"  echo "where COMMAND is one of:"  echo "  namenode -format     format the DFS filesystem"  echo "  secondarynamenode    run the DFS secondary namenode"  echo "  namenode             run the DFS namenode"  echo "  datanode             run a DFS datanode"  echo "  dfsadmin             run a DFS admin client"  echo "  mradmin              run a Map-Reduce admin client"  echo "  fsck                 run a DFS filesystem checking utility"  echo "  fs                   run a generic filesystem user client"  echo "  balancer             run a cluster balancing utility"  echo "  jobtracker           run the MapReduce job Tracker node"   echo "  pipes                run a Pipes job"  echo "  tasktracker          run a MapReduce task Tracker node"   echo "  job                  manipulate MapReduce jobs"  echo "  queue                get information regarding JobQueues"   echo "  version              print the version"  echo "  jar <jar>            run a jar file"  echo "  distcp <srcurl> <desturl> copy file or directories recursively"  echo "  archive -archiveName NAME <src>* <dest> create a hadoop archive"  echo "  daemonlog            get/set the log level for each daemon"  echo " or"  echo "  CLASSNAME            run the class named CLASSNAME"  echo "Most commands print help when invoked w/o parameters."  exit 1fi

2. Set the Java Runtime Environment
The code is simple and you will not write it out, including java_home, java_heap_max, classpath, hadoop_log_dir, and hadoop_policyfile. The settings are used. IFS -Environment variable for defining symbols. The default value is a blank character (line feed, Tab character, or space ).

3. Set the runtime class according to cmd

# figure out which class to runif [ "$COMMAND" = "namenode" ] ; then  CLASS=‘org.apache.hadoop.hdfs.server.namenode.NameNode‘  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_NAMENODE_OPTS"elif [ "$COMMAND" = "secondarynamenode" ] ; then  CLASS=‘org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode‘  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_SECONDARYNAMENODE_OPTS"elif [ "$COMMAND" = "datanode" ] ; then  CLASS=‘org.apache.hadoop.hdfs.server.datanode.DataNode‘  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_DATANODE_OPTS"elif [ "$COMMAND" = "fs" ] ; then  CLASS=org.apache.hadoop.fs.FsShell  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "dfs" ] ; then  CLASS=org.apache.hadoop.fs.FsShell  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "dfsadmin" ] ; then  CLASS=org.apache.hadoop.hdfs.tools.DFSAdmin  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "mradmin" ] ; then  CLASS=org.apache.hadoop.mapred.tools.MRAdmin  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "fsck" ] ; then  CLASS=org.apache.hadoop.hdfs.tools.DFSck  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "balancer" ] ; then  CLASS=org.apache.hadoop.hdfs.server.balancer.Balancer  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_BALANCER_OPTS"elif [ "$COMMAND" = "jobtracker" ] ; then  CLASS=org.apache.hadoop.mapred.JobTracker  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_JOBTRACKER_OPTS"elif [ "$COMMAND" = "tasktracker" ] ; then  CLASS=org.apache.hadoop.mapred.TaskTracker  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_TASKTRACKER_OPTS"elif [ "$COMMAND" = "job" ] ; then  CLASS=org.apache.hadoop.mapred.JobClientelif [ "$COMMAND" = "queue" ] ; then  CLASS=org.apache.hadoop.mapred.JobQueueClientelif [ "$COMMAND" = "pipes" ] ; then  CLASS=org.apache.hadoop.mapred.pipes.Submitter  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "version" ] ; then  CLASS=org.apache.hadoop.util.VersionInfo  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "jar" ] ; then  CLASS=org.apache.hadoop.util.RunJarelif [ "$COMMAND" = "distcp" ] ; then  CLASS=org.apache.hadoop.tools.DistCp  CLASSPATH=${CLASSPATH}:${TOOL_PATH}  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "daemonlog" ] ; then  CLASS=org.apache.hadoop.log.LogLevel  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "archive" ] ; then  CLASS=org.apache.hadoop.tools.HadoopArchives  CLASSPATH=${CLASSPATH}:${TOOL_PATH}  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "sampler" ] ; then  CLASS=org.apache.hadoop.mapred.lib.InputSampler  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"else  CLASS=$COMMANDfi

4. Set the local database

# setup ‘java.library.path‘ for native-hadoop code if necessaryJAVA_LIBRARY_PATH=‘‘if [ -d "${HADOOP_HOME}/build/native" -o -d "${HADOOP_HOME}/lib/native" ]; then#通过运行一个java 类来决定当前平台，挺有意思  JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"`    if [ -d "$HADOOP_HOME/build/native" ]; then    JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib  fi    if [ -d "${HADOOP_HOME}/lib/native" ]; then    if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then      JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${HADOOP_HOME}/lib/native/${JAVA_PLATFORM}    else      JAVA_LIBRARY_PATH=${HADOOP_HOME}/lib/native/${JAVA_PLATFORM}    fi  fifi

5. Run distributed programs

 # run itexec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "[email protected]"

2.3. The hadoop-daemon.sh starts or stops the distributed program specified by the local command parameter by calling the hadoop script. In fact, it is quite simple.

1. Declare usage

usage="Usage: hadoop-daemon.sh [--config <conf-dir>] [--hosts hostlistfile] (start|stop)
2. Set Environment Variables

First run the hadoop-env.sh script embedded, and then set environment variables such as hadoop_pid_dir.

3. start or stop the program

case $startStop in  (start)    mkdir -p "$HADOOP_PID_DIR"    if [ -f $pid ]; then    #如果程序已经启动的话，就停止，并退出。      if kill -0 `cat $pid` > /dev/null 2>&1; then        echo $command running as process `cat $pid`.  Stop it first.        exit 1      fi    fi    if [ "$HADOOP_MASTER" != "" ]; then      echo rsync from $HADOOP_MASTER      rsync -a -e ssh --delete --exclude=.svn --exclude=‘logs/*‘ --exclude=‘contrib/hod/logs/*‘ $HADOOP_MASTER/ "$HADOOP_HOME"    fi# rotate 当前已经存在的log    hadoop_rotate_log $log    echo starting $command, logging to $log    cd "$HADOOP_HOME"    #通过nohup 和bin/hadoop脚本启动相关程序    nohup nice -n $HADOOP_NICENESS "$HADOOP_HOME"/bin/hadoop --config $HADOOP_CONF_DIR $command "[email protected]" > "$log" 2>&1 < /dev/null &    #获取新启动的进程pid并写入到pid文件中    echo $! > $pid    sleep 1; head "$log"    ;;            (stop)    if [ -f $pid ]; then      if kill -0 `cat $pid` > /dev/null 2>&1; then        echo stopping $command        kill `cat $pid`      else        echo no $command to stop      fi    else      echo no $command to stop    fi    ;;  (*)    echo $usage    exit 1    ;;esac

2.4. Slaves. Sh

Run a set of specified commands on all machines (Log On via SSH without a password) for upper-layer use.

1. Declare usage

usage="Usage: slaves.sh [--config confdir] command..."# if no args specified, show usageif [ $# -le 0 ]; then  echo $usage  exit 1fi

2. Set the remote host list

# If the slaves file is specified in the command line,# then it takes precedence over the definition in # hadoop-env.sh. Save it here.HOSTLIST=$HADOOP_SLAVESif [ -f "${HADOOP_CONF_DIR}/hadoop-env.sh" ]; then  . "${HADOOP_CONF_DIR}/hadoop-env.sh"fiif [ "$HOSTLIST" = "" ]; then  if [ "$HADOOP_SLAVES" = "" ]; then    export HOSTLIST="${HADOOP_CONF_DIR}/slaves"  else    export HOSTLIST="${HADOOP_SLAVES}"  fifi

3. execute commands on the remote host

#挺重要，里面技术含量也挺高，对远程主机文件进行去除特殊字符和删除空行；对命令行进行空格替换，并通过ssh在目标主机执行命令；最后等待命令在所有目标主机执行完后，退出。for slave in `cat "$HOSTLIST"|sed  "s/#.*$//;/^$/d"`; do ssh $HADOOP_SSH_OPTS $slave $"${@// /\\ }"    2>&1 | sed "s/^/$slave: /" & if [ "$HADOOP_SLAVE_SLEEP" != "" ]; then   sleep $HADOOP_SLAVE_SLEEP fidonewait

Hadoop-daemons.sh 2.5 Start the hadoop distributed program on the remote machine and implement it by calling slaves. Sh.
1. Declare usage

# Run a Hadoop command on all slave hosts.usage="Usage: hadoop-daemons.sh [--config confdir] [--hosts hostlistfile] [start|stop] command args..."# if no args specified, show usageif [ $# -le 1 ]; then  echo $usage  exit 1fi

2. Call commands on a remote host

 #通过salves.sh来实现 exec "$bin/slaves.sh" --config $HADOOP_CONF_DIR cd "$HADOOP_HOME" \; "$bin/hadoop-daemon.sh" --config $HADOOP_CONF_DIR "[email protected]"

Start-dfs.sh 2.6

InThe Local Machine (the host that calls this script) starts namenode, Start datanode on the server Load balancer instanceStart secondarynamenode on the master machineBy calling hadoop-daemon.sh and hadoop-daemons.sh.

1. Declare the usage

# Start hadoop dfs daemons.# Optinally upgrade or rollback dfs state.# Run this on master node.usage="Usage: start-dfs.sh [-upgrade|-rollback]"

2. Start the program

# start dfs daemons# start namenode after datanodes, to minimize time namenode is up w/o data# note: datanodes will log connection errors until namenode starts#在本机（调用此脚本的主机）启动namenode"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt#在slaves机器上启动datanode"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt#在master机器上启动secondarynamenode"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode

Start-mapred.sh 2.7 Start jobtracker on the machine (the host that calls this script), start tasktracker on the server Load balancer, and implement it by calling the hadoop-daemon.sh and hadoop-daemons.sh.

 # start mapred daemons# start jobtracker first to minimize connection errors at startup#在本机（调用此脚本的主机）启动jobtracker"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker#在master机器上启动tasktracker"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker

Other scripts are already very simple. You don't have to explain them in detail. You can understand them as long as you read them.

By the way, let's talk about the shell interpreter statement in the hadoop script.

#!/usr/bin/env bash

It is useful to adapt to various Linux operating systems and find bash shell to explain and execute this script.

Hadoop In The Big Data era (II): hadoop script Parsing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop In The Big Data era (II): hadoop script Parsing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop In The Big Data era (II): hadoop script Parsing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support