Hadoop In The Big Data era (1): hadoop Installation
If you want to have a better understanding of hadoop, you must first understand how to start or stop the hadoop script. After all,Hadoop is a distributed storage and computing framework.But how to start and manage the distributed environment? Let's start with the script first. To be honest, the hadoop STARTUP script is well written, and the considerations are very comprehensive (for example, there are spaces in the path and soft connections ).
1. hadoop script Introduction
Hadoop scripts are distributed under the bin directory under $ hadoop_home and the conf folder. The main introduction is as follows:
Bin directory
Hadoop core script, all distributed programs are finally started through this script.
The basic scripts of the hadoop-config.sh are embedded to call this script, which is used to parse the optional parameters of the command line (-- config: hadoop conf folder path and -- hosts)
The hadoop-daemon.sh starts or stops the distributed program specified by the local command parameter by calling the hadoop script.
The hadoop-daemons.sh starts hadoop distributed programs on all machines by calling slaves. Sh.
Slaves. Sh runs a set of specified commands on all machines (using SSH without a password) for upper-layer use.
The start-dfs.sh starts namenode on the local machine, starts datanode on the slaves machine, and starts secondarynamenode on the master machine by calling the hadoop-daemon.sh and hadoop-daemons.sh.
The start-mapred.sh starts jobtracker on the local machine, starts tasktracker on the slaves machine, and is done by calling the hadoop-daemon.sh and hadoop-daemons.sh.
Start-all.sh starts all distributed hadoop programs by calling start-dfs.sh and start-mapred.sh.
The start-balancer.sh starts the hadoop distributed environment complex balanced scheduling program, balancing the storage and processing capabilities of each node.
There are several stop scripts, so you don't need to elaborate.
Under the conf directory
The hadoop-env.sh configures some parameter variables required for hadoop runtime, such as java_home, hadoop_log_dir, hadoop_pid_dir, etc.
2. the charm of the script (detailed explanation) hadoop's script is really good to write, not satisfied, and I learned a lot from it.
2.1 hadoop-config.sh this script is relatively simple, and basic other scripts are embedded through ". $ bin/hadoop-config.sh "calls this script, so this script does not have to declare the right of interpretation in the first line, this call method is similar to copying the script content to the parent script and running it in the same interpreter.
This script mainly includes three parts:
1. soft connection resolution and absolute path resolution
#软连接解析this="$0"while [ -h "$this" ]; do ls=`ls -ld "$this"` link=`expr "$ls" : ‘.*-> \(.*\)$‘` if expr "$link" : ‘.*/.*‘ > /dev/null; then this="$link" else this=`dirname "$this"`/"$link" fidone#绝对路径解析# convert relative path to absolute pathbin=`dirname "$this"`script=`basename "$this"`bin=`cd "$bin"; pwd`this="$bin/$script"# the root of the Hadoop installationexport HADOOP_HOME=`dirname "$this"`/..
2. Command Line optional parameter -- config resolution and value assignment
#check to see if the conf dir is given as an optional argumentif [ $# -gt 1 ]then if [ "--config" = "$1" ] then shift confdir=$1 shift HADOOP_CONF_DIR=$confdir fifi
3. Command Line optional parameter -- config resolution and value assignment
#check to see it is specified whether to use the slaves or the# masters fileif [ $# -gt 1 ]then if [ "--hosts" = "$1" ] then shift slavesfile=$1 shift export HADOOP_SLAVES="${HADOOP_CONF_DIR}/$slavesfile" fifi
2.2 hadoop
This script is the core of the hadoop script. It is used to set variables and start programs.
1. Declare usage
# if no args specified, show usageif [ $# = 0 ]; then echo "Usage: hadoop [--config confdir] COMMAND" echo "where COMMAND is one of:" echo " namenode -format format the DFS filesystem" echo " secondarynamenode run the DFS secondary namenode" echo " namenode run the DFS namenode" echo " datanode run a DFS datanode" echo " dfsadmin run a DFS admin client" echo " mradmin run a Map-Reduce admin client" echo " fsck run a DFS filesystem checking utility" echo " fs run a generic filesystem user client" echo " balancer run a cluster balancing utility" echo " jobtracker run the MapReduce job Tracker node" echo " pipes run a Pipes job" echo " tasktracker run a MapReduce task Tracker node" echo " job manipulate MapReduce jobs" echo " queue get information regarding JobQueues" echo " version print the version" echo " jar <jar> run a jar file" echo " distcp <srcurl> <desturl> copy file or directories recursively" echo " archive -archiveName NAME <src>* <dest> create a hadoop archive" echo " daemonlog get/set the log level for each daemon" echo " or" echo " CLASSNAME run the class named CLASSNAME" echo "Most commands print help when invoked w/o parameters." exit 1fi
2. Set the Java Runtime Environment
The code is simple and you will not write it out, including java_home, java_heap_max, classpath, hadoop_log_dir, and hadoop_policyfile. The settings are used.
IFS -Environment variable for defining symbols. The default value is a blank character (line feed, Tab character, or space ).
3. Set the runtime class according to cmd
# figure out which class to runif [ "$COMMAND" = "namenode" ] ; then CLASS=‘org.apache.hadoop.hdfs.server.namenode.NameNode‘ HADOOP_OPTS="$HADOOP_OPTS $HADOOP_NAMENODE_OPTS"elif [ "$COMMAND" = "secondarynamenode" ] ; then CLASS=‘org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode‘ HADOOP_OPTS="$HADOOP_OPTS $HADOOP_SECONDARYNAMENODE_OPTS"elif [ "$COMMAND" = "datanode" ] ; then CLASS=‘org.apache.hadoop.hdfs.server.datanode.DataNode‘ HADOOP_OPTS="$HADOOP_OPTS $HADOOP_DATANODE_OPTS"elif [ "$COMMAND" = "fs" ] ; then CLASS=org.apache.hadoop.fs.FsShell HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "dfs" ] ; then CLASS=org.apache.hadoop.fs.FsShell HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "dfsadmin" ] ; then CLASS=org.apache.hadoop.hdfs.tools.DFSAdmin HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "mradmin" ] ; then CLASS=org.apache.hadoop.mapred.tools.MRAdmin HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "fsck" ] ; then CLASS=org.apache.hadoop.hdfs.tools.DFSck HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "balancer" ] ; then CLASS=org.apache.hadoop.hdfs.server.balancer.Balancer HADOOP_OPTS="$HADOOP_OPTS $HADOOP_BALANCER_OPTS"elif [ "$COMMAND" = "jobtracker" ] ; then CLASS=org.apache.hadoop.mapred.JobTracker HADOOP_OPTS="$HADOOP_OPTS $HADOOP_JOBTRACKER_OPTS"elif [ "$COMMAND" = "tasktracker" ] ; then CLASS=org.apache.hadoop.mapred.TaskTracker HADOOP_OPTS="$HADOOP_OPTS $HADOOP_TASKTRACKER_OPTS"elif [ "$COMMAND" = "job" ] ; then CLASS=org.apache.hadoop.mapred.JobClientelif [ "$COMMAND" = "queue" ] ; then CLASS=org.apache.hadoop.mapred.JobQueueClientelif [ "$COMMAND" = "pipes" ] ; then CLASS=org.apache.hadoop.mapred.pipes.Submitter HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "version" ] ; then CLASS=org.apache.hadoop.util.VersionInfo HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "jar" ] ; then CLASS=org.apache.hadoop.util.RunJarelif [ "$COMMAND" = "distcp" ] ; then CLASS=org.apache.hadoop.tools.DistCp CLASSPATH=${CLASSPATH}:${TOOL_PATH} HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "daemonlog" ] ; then CLASS=org.apache.hadoop.log.LogLevel HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "archive" ] ; then CLASS=org.apache.hadoop.tools.HadoopArchives CLASSPATH=${CLASSPATH}:${TOOL_PATH} HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"elif [ "$COMMAND" = "sampler" ] ; then CLASS=org.apache.hadoop.mapred.lib.InputSampler HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"else CLASS=$COMMANDfi
4. Set the local database
# setup ‘java.library.path‘ for native-hadoop code if necessaryJAVA_LIBRARY_PATH=‘‘if [ -d "${HADOOP_HOME}/build/native" -o -d "${HADOOP_HOME}/lib/native" ]; then#通过运行一个java 类来决定当前平台,挺有意思 JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"` if [ -d "$HADOOP_HOME/build/native" ]; then JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib fi if [ -d "${HADOOP_HOME}/lib/native" ]; then if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${HADOOP_HOME}/lib/native/${JAVA_PLATFORM} else JAVA_LIBRARY_PATH=${HADOOP_HOME}/lib/native/${JAVA_PLATFORM} fi fifi
5. Run distributed programs
# run itexec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "[email protected]"
2.3. The hadoop-daemon.sh starts or stops the distributed program specified by the local command parameter by calling the hadoop script. In fact, it is quite simple.
1. Declare usage
usage="Usage: hadoop-daemon.sh [--config <conf-dir>] [--hosts hostlistfile] (start|stop)
2. Set Environment Variables
First run the hadoop-env.sh script embedded, and then set environment variables such as hadoop_pid_dir.
3. start or stop the program
case $startStop in (start) mkdir -p "$HADOOP_PID_DIR" if [ -f $pid ]; then #如果程序已经启动的话,就停止,并退出。 if kill -0 `cat $pid` > /dev/null 2>&1; then echo $command running as process `cat $pid`. Stop it first. exit 1 fi fi if [ "$HADOOP_MASTER" != "" ]; then echo rsync from $HADOOP_MASTER rsync -a -e ssh --delete --exclude=.svn --exclude=‘logs/*‘ --exclude=‘contrib/hod/logs/*‘ $HADOOP_MASTER/ "$HADOOP_HOME" fi# rotate 当前已经存在的log hadoop_rotate_log $log echo starting $command, logging to $log cd "$HADOOP_HOME" #通过nohup 和bin/hadoop脚本启动相关程序 nohup nice -n $HADOOP_NICENESS "$HADOOP_HOME"/bin/hadoop --config $HADOOP_CONF_DIR $command "[email protected]" > "$log" 2>&1 < /dev/null & #获取新启动的进程pid并写入到pid文件中 echo $! > $pid sleep 1; head "$log" ;; (stop) if [ -f $pid ]; then if kill -0 `cat $pid` > /dev/null 2>&1; then echo stopping $command kill `cat $pid` else echo no $command to stop fi else echo no $command to stop fi ;; (*) echo $usage exit 1 ;;esac
2.4. Slaves. Sh
Run a set of specified commands on all machines (Log On via SSH without a password) for upper-layer use.
1. Declare usage
usage="Usage: slaves.sh [--config confdir] command..."# if no args specified, show usageif [ $# -le 0 ]; then echo $usage exit 1fi
2. Set the remote host list
# If the slaves file is specified in the command line,# then it takes precedence over the definition in # hadoop-env.sh. Save it here.HOSTLIST=$HADOOP_SLAVESif [ -f "${HADOOP_CONF_DIR}/hadoop-env.sh" ]; then . "${HADOOP_CONF_DIR}/hadoop-env.sh"fiif [ "$HOSTLIST" = "" ]; then if [ "$HADOOP_SLAVES" = "" ]; then export HOSTLIST="${HADOOP_CONF_DIR}/slaves" else export HOSTLIST="${HADOOP_SLAVES}" fifi
3. execute commands on the remote host
#挺重要,里面技术含量也挺高,对远程主机文件进行去除特殊字符和删除空行;对命令行进行空格替换,并通过ssh在目标主机执行命令;最后等待命令在所有目标主机执行完后,退出。for slave in `cat "$HOSTLIST"|sed "s/#.*$//;/^$/d"`; do ssh $HADOOP_SSH_OPTS $slave $"${@// /\\ }" 2>&1 | sed "s/^/$slave: /" & if [ "$HADOOP_SLAVE_SLEEP" != "" ]; then sleep $HADOOP_SLAVE_SLEEP fidonewait
Hadoop-daemons.sh 2.5
Start the hadoop distributed program on the remote machine and implement it by calling slaves. Sh.
1. Declare usage
# Run a Hadoop command on all slave hosts.usage="Usage: hadoop-daemons.sh [--config confdir] [--hosts hostlistfile] [start|stop] command args..."# if no args specified, show usageif [ $# -le 1 ]; then echo $usage exit 1fi
2. Call commands on a remote host
#通过salves.sh来实现 exec "$bin/slaves.sh" --config $HADOOP_CONF_DIR cd "$HADOOP_HOME" \; "$bin/hadoop-daemon.sh" --config $HADOOP_CONF_DIR "[email protected]"
Start-dfs.sh 2.6
InThe Local Machine (the host that calls this script) starts namenode, Start datanode on the server Load balancer instanceStart secondarynamenode on the master machineBy calling hadoop-daemon.sh and hadoop-daemons.sh.
1. Declare the usage
# Start hadoop dfs daemons.# Optinally upgrade or rollback dfs state.# Run this on master node.usage="Usage: start-dfs.sh [-upgrade|-rollback]"
2. Start the program
# start dfs daemons# start namenode after datanodes, to minimize time namenode is up w/o data# note: datanodes will log connection errors until namenode starts#在本机(调用此脚本的主机)启动namenode"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt#在slaves机器上启动datanode"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt#在master机器上启动secondarynamenode"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode
Start-mapred.sh 2.7 Start jobtracker on the machine (the host that calls this script), start tasktracker on the server Load balancer, and implement it by calling the hadoop-daemon.sh and hadoop-daemons.sh.
# start mapred daemons# start jobtracker first to minimize connection errors at startup#在本机(调用此脚本的主机)启动jobtracker"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker#在master机器上启动tasktracker"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker
Other scripts are already very simple. You don't have to explain them in detail. You can understand them as long as you read them.
By the way, let's talk about the shell interpreter statement in the hadoop script.
#!/usr/bin/env bash
It is useful to adapt to various Linux operating systems and find bash shell to explain and execute this script.
Hadoop In The Big Data era (II): hadoop script Parsing