標籤:hadoop 指令碼 安裝 啟動 解析
大資料時代之hadoop(一):hadoop安裝
“兵馬未動,糧草先行”,要想深入的瞭解hadoop,我覺得啟動或停止hadoop的指令碼是必須要先瞭解的。說到底,hadoop就是一個分布式儲存和計算架構,但是這個分布式環境是如何啟動,管理的呢,我就帶著大家先從指令碼入手吧。說實話,hadoop的啟動指令碼寫的真好,裡面考慮的地方非常周全(比如說路徑中有空格,軟串連等)。
1、hadoop指令碼簡單介紹
hadoop的指令碼分布在$HADOOP_HOME下面的bin目錄下和conf檔案夾下,主要介紹如下:
bin目錄下
hadoop hadoop底層核心指令碼,所有分布式程式最終都是通過這個指令碼啟動的。
hadoop-config.sh 基本別的指令碼都會內嵌調用這個指令碼,這個指令碼作用就是解析命令列選擇性參數(--config :hadoop conf檔案夾路徑 和--hosts)
hadoop-daemon.sh 啟動或停止本機command參數所指定的分布式程式,通過調用hadoop指令碼實現。
hadoop-daemons.sh 啟動所有機器上的hadoop分布式程式,通過調用slaves.sh實現。
slaves.sh 在所有的機器上運行一組指定的命令(通過ssh無密碼登陸),供上層使用。
start-dfs.sh 在本機啟動namenode,在slaves機器上啟動datanode,在master機器上啟動secondarynamenode,通過調用hadoop-daemon.sh和hadoop-daemons.sh實現。
start-mapred.sh 在本機啟動jobtracker,在slaves機器上啟動tasktracker,通過調用hadoop-daemon.sh和hadoop-daemons.sh實現。
start-all.sh 啟動所有分布式hadoop程式,通過調用start-dfs.sh和start-mapred.sh實現。
start-balancer.sh 啟動hadoop分布式環境複雜均衡發送器,平衡各節點儲存和處理能力。
還有幾個stop 指令碼,就不用詳細說了。
conf目錄下
hadoop-env.sh 配置hadoop運行時所需要的一些參數變數,比如JAVA_HOME,HADOOP_LOG_DIR,HADOOP_PID_DIR等。
2、指令碼的魅力(詳細解釋)hadoop的指令碼寫的真好,不服不行,從中學習到了好多知識。
2.1、hadoop-config.sh 這個指令碼比較簡單,而且基本其他指令碼都內嵌通過“. $bin/hadoop-config.sh”的形式調用此指令碼,所以這個指令碼就不用在第一行聲明解釋權,因為這種調用方式類似於把此指令碼內容複寫到父指令碼裡在同一個解譯器裡面運行。
這個指令碼主要做三部分內容:
1、軟串連解析和絕對路徑解析
#軟串連解析this="$0"while [ -h "$this" ]; do ls=`ls -ld "$this"` link=`expr "$ls" : ‘.*-> \(.*\)$‘` if expr "$link" : ‘.*/.*‘ > /dev/null; then this="$link" else this=`dirname "$this"`/"$link" fidone#絕對路徑解析# convert relative path to absolute pathbin=`dirname "$this"`script=`basename "$this"`bin=`cd "$bin"; pwd`this="$bin/$script"# the root of the Hadoop installationexport HADOOP_HOME=`dirname "$this"`/..
2、命令列選擇性參數--config解析並賦值
#check to see if the conf dir is given as an optional argumentif [ $# -gt 1 ]then if [ "--config" = "$1" ] then shift confdir=$1 shift HADOOP_CONF_DIR=$confdir fifi
3、命令列選擇性參數--config解析並賦值
#check to see it is specified whether to use the slaves or the# masters fileif [ $# -gt 1 ]then if [ "--hosts" = "$1" ] then shift slavesfile=$1 shift export HADOOP_SLAVES="${HADOOP_CONF_DIR}/$slavesfile" fifi
2.2、hadoop
此指令碼是hadoop指令碼的核心,變數的設定,程式的啟動都是通過這個指令碼做的。
1、聲明使用方法
# if no args specified, show usageif [ $# = 0 ]; then echo "Usage: hadoop [--config confdir] COMMAND" echo "where COMMAND is one of:" echo " namenode -format format the DFS filesystem" echo " secondarynamenode run the DFS secondary namenode" echo " namenode run the DFS namenode" echo " datanode run a DFS datanode" echo " dfsadmin run a DFS admin client" echo " mradmin run a Map-Reduce admin client" echo " fsck run a DFS filesystem checking utility" echo " fs run a generic filesystem user client" echo " balancer run a cluster balancing utility" echo " jobtracker run the MapReduce job Tracker node" echo " pipes run a Pipes job" echo " tasktracker run a MapReduce task Tracker node" echo " job manipulate MapReduce jobs" echo " queue get information regarding JobQueues" echo " version print the version" echo " jar <jar> run a jar file" echo " distcp <srcurl> <desturl> copy file or directories recursively" echo " archive -archiveName NAME <src>* <dest> create a hadoop archive" echo " daemonlog get/set the log level for each daemon" echo " or" echo " CLASSNAME run the class named CLASSNAME" echo "Most commands print help when invoked w/o parameters." exit 1fi
2、設定java運行環境
代碼簡單,就不寫出來了,包括JAVA_HOME,JAVA_HEAP_MAX,CLASSPATH,HADOOP_LOG_DIR,HADOOP_POLICYFILE。其中用到了設定
IFS-儲界定符號的環境變數,預設值是空白字元(換行,定位字元或者空格)。
3、根據cmd設定運行時class
# figure out which class to runif [ "$COMMAND" = "namenode" ] ; then CLASS=‘org.apache.hadoop.hdfs.server.namenode.NameNode‘ HADOOP_OPTS="$HADOOP_OPTS $HADOOP_NAMENODE_OPTS"elif [ "$COMMAND" = "secondarynamenode" ] ; then CLASS=‘org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode‘ HADOOP_OPTS="$HADOOP_OPTS $HADOOP_SECONDARYNAMENODE_OPTS"elif [ "$COMMAND" = "datanode" ] ; then CLASS=‘org.apache.hadoop.hdfs.server.datanode.DataNode‘ HADOOP_OPTS="$HADOOP_OPTS $HADOOP_DATANODE_OPTS"elif [ "$COMMAND" = "fs" ] ; then CLASS=org.apache.hadoop.fs.FsShell HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "dfs" ] ; then CLASS=org.apache.hadoop.fs.FsShell HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "dfsadmin" ] ; then CLASS=org.apache.hadoop.hdfs.tools.DFSAdmin HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "mradmin" ] ; then CLASS=org.apache.hadoop.mapred.tools.MRAdmin HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "fsck" ] ; then CLASS=org.apache.hadoop.hdfs.tools.DFSck HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "balancer" ] ; then CLASS=org.apache.hadoop.hdfs.server.balancer.Balancer HADOOP_OPTS="$HADOOP_OPTS $HADOOP_BALANCER_OPTS"elif [ "$COMMAND" = "jobtracker" ] ; then CLASS=org.apache.hadoop.mapred.JobTracker HADOOP_OPTS="$HADOOP_OPTS $HADOOP_JOBTRACKER_OPTS"elif [ "$COMMAND" = "tasktracker" ] ; then CLASS=org.apache.hadoop.mapred.TaskTracker HADOOP_OPTS="$HADOOP_OPTS $HADOOP_TASKTRACKER_OPTS"elif [ "$COMMAND" = "job" ] ; then CLASS=org.apache.hadoop.mapred.JobClientelif [ "$COMMAND" = "queue" ] ; then CLASS=org.apache.hadoop.mapred.JobQueueClientelif [ "$COMMAND" = "pipes" ] ; then CLASS=org.apache.hadoop.mapred.pipes.Submitter HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "version" ] ; then CLASS=org.apache.hadoop.util.VersionInfo HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "jar" ] ; then CLASS=org.apache.hadoop.util.RunJarelif [ "$COMMAND" = "distcp" ] ; then CLASS=org.apache.hadoop.tools.DistCp CLASSPATH=${CLASSPATH}:${TOOL_PATH} HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "daemonlog" ] ; then CLASS=org.apache.hadoop.log.LogLevel HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "archive" ] ; then CLASS=org.apache.hadoop.tools.HadoopArchives CLASSPATH=${CLASSPATH}:${TOOL_PATH} HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "sampler" ] ; then CLASS=org.apache.hadoop.mapred.lib.InputSampler HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"else CLASS=$COMMANDfi
4、設定本地庫
# setup ‘java.library.path‘ for native-hadoop code if necessaryJAVA_LIBRARY_PATH=‘‘if [ -d "${HADOOP_HOME}/build/native" -o -d "${HADOOP_HOME}/lib/native" ]; then#通過運行一個java 類來決定當前平台,挺有意思 JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"` if [ -d "$HADOOP_HOME/build/native" ]; then JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib fi if [ -d "${HADOOP_HOME}/lib/native" ]; then if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${HADOOP_HOME}/lib/native/${JAVA_PLATFORM} else JAVA_LIBRARY_PATH=${HADOOP_HOME}/lib/native/${JAVA_PLATFORM} fi fifi
5、運行分布式程式
# run itexec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "[email protected]"
2.3、hadoop-daemon.sh 啟動或停止本機command參數所指定的分布式程式,通過調用hadoop指令碼實現,其實也挺簡單的。
1、聲明使用方法
usage="Usage: hadoop-daemon.sh [--config <conf-dir>] [--hosts hostlistfile] (start|stop) <hadoop-command> <args...>"# if no args specified, show usageif [ $# -le 1 ]; then echo $usage exit 1fi
2、設定環境變數
首先內嵌運行hadoop-env.sh指令碼,然後設定HADOOP_PID_DIR等環境變數。
3、啟動或停止程式
case $startStop in (start) mkdir -p "$HADOOP_PID_DIR" if [ -f $pid ]; then #如果程式已經啟動的話,就停止,並退出。 if kill -0 `cat $pid` > /dev/null 2>&1; then echo $command running as process `cat $pid`. Stop it first. exit 1 fi fi if [ "$HADOOP_MASTER" != "" ]; then echo rsync from $HADOOP_MASTER rsync -a -e ssh --delete --exclude=.svn --exclude=‘logs/*‘ --exclude=‘contrib/hod/logs/*‘ $HADOOP_MASTER/ "$HADOOP_HOME" fi# rotate 當前已經存在的log hadoop_rotate_log $log echo starting $command, logging to $log cd "$HADOOP_HOME" #通過nohup 和bin/hadoop指令碼啟動相關程式 nohup nice -n $HADOOP_NICENESS "$HADOOP_HOME"/bin/hadoop --config $HADOOP_CONF_DIR $command "[email protected]" > "$log" 2>&1 < /dev/null & #擷取新啟動的進程pid並寫入到pid檔案中 echo $! > $pid sleep 1; head "$log" ;; (stop) if [ -f $pid ]; then if kill -0 `cat $pid` > /dev/null 2>&1; then echo stopping $command kill `cat $pid` else echo no $command to stop fi else echo no $command to stop fi ;; (*) echo $usage exit 1 ;;esac
2.4、slaves.sh
在所有的機器上運行一組指定的命令(通過ssh無密碼登陸),供上層使用。
1、聲明使用方法
usage="Usage: slaves.sh [--config confdir] command..."# if no args specified, show usageif [ $# -le 0 ]; then echo $usage exit 1fi
2、設定遠程主機列表
# If the slaves file is specified in the command line,# then it takes precedence over the definition in # hadoop-env.sh. Save it here.HOSTLIST=$HADOOP_SLAVESif [ -f "${HADOOP_CONF_DIR}/hadoop-env.sh" ]; then . "${HADOOP_CONF_DIR}/hadoop-env.sh"fiif [ "$HOSTLIST" = "" ]; then if [ "$HADOOP_SLAVES" = "" ]; then export HOSTLIST="${HADOOP_CONF_DIR}/slaves" else export HOSTLIST="${HADOOP_SLAVES}" fifi
3、分別在遠程主機執行相關命令
#挺重要,裡面技術含量也挺高,對遠程主機檔案進行去除特殊字元和刪除空行;對命令列進行空格替換,並通過ssh在目標主機執行命令;最後等待命令在所有目標主機執行完後,退出。for slave in `cat "$HOSTLIST"|sed "s/#.*$//;/^$/d"`; do ssh $HADOOP_SSH_OPTS $slave $"${@// /\\ }" 2>&1 | sed "s/^/$slave: /" & if [ "$HADOOP_SLAVE_SLEEP" != "" ]; then sleep $HADOOP_SLAVE_SLEEP fidonewait
2.5、hadoop-daemons.sh
啟動遠程機器上的hadoop分布式程式,通過調用slaves.sh實現。
1、聲明使用方法
# Run a Hadoop command on all slave hosts.usage="Usage: hadoop-daemons.sh [--config confdir] [--hosts hostlistfile] [start|stop] command args..."# if no args specified, show usageif [ $# -le 1 ]; then echo $usage exit 1fi
2、在遠程主機調用命令
#通過salves.sh來實現 exec "$bin/slaves.sh" --config $HADOOP_CONF_DIR cd "$HADOOP_HOME" \; "$bin/hadoop-daemon.sh" --config $HADOOP_CONF_DIR "[email protected]"
2.6、start-dfs.sh
在本機(調用此指令碼的主機)啟動namenode,在slaves機器上啟動datanode,在master機器上啟動secondarynamenode,通過調用hadoop-daemon.sh和hadoop-daemons.sh實現。
1、聲明使用方式
# Start hadoop dfs daemons.# Optinally upgrade or rollback dfs state.# Run this on master node.usage="Usage: start-dfs.sh [-upgrade|-rollback]"
2、啟動程式
# start dfs daemons# start namenode after datanodes, to minimize time namenode is up w/o data# note: datanodes will log connection errors until namenode starts#在本機(調用此指令碼的主機)啟動namenode"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt#在slaves機器上啟動datanode"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt#在master機器上啟動secondarynamenode"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode
2.7、start-mapred.sh 在本機(調用此指令碼的主機)啟動jobtracker,在slaves機器上啟動tasktracker,通過調用hadoop-daemon.sh和hadoop-daemons.sh實現。
# start mapred daemons# start jobtracker first to minimize connection errors at startup#在本機(調用此指令碼的主機)啟動jobtracker"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker#在master機器上啟動tasktracker"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker
其他的指令碼就都已經非常簡單了,不用再詳細說明了,只要看下,大致都能看懂。
對了,最後再說下hadoop的指令碼裡面用的shell解譯器的聲明吧。
#!/usr/bin/env bash
作用就是適應各種linux作業系統,能夠找到 bash shell來解釋執行本指令碼,也挺有用的。
大資料時代之hadoop(二):hadoop指令碼解析