Objective
This article describes how to install, configure, and manage a meaningful Hadoop cluster, which can scale from small clusters of nodes to thousands of-node large clusters.
If you want to install Hadoop on a single machine, you can find the details here.
prerequisites Ensure that all required software is installed on each node in your cluster. Get the Hadoop package. Install
Installing the Hadoop cluster typically extracts the installation software onto all the machines in the cluster.
Typically, a machine in a cluster is designated as Namenode, and a different machine is designated as Jobtracker. These machines are masters. The remaining machines as Datanode also as Tasktracker. These machines are slaves.
We use Hadoop_home to refer to the root path of the installation. Typically, all machines in a cluster have the same hadoop_home path.
Configuration
The following sections describe how to configure the Hadoop cluster.
configuration file
The configuration of Hadoop is accomplished through two important profiles in the conf/directory:
Hadoop-default.xml-read-only default configuration. Hadoop-site.xml-cluster-specific configuration.
To learn more about how these profiles affect the Hadoop framework, see here.
In addition, you can control the Hadoop scripts in the bin/directory by setting the variables in the conf/hadoop-env.sh to the cluster-specific values.
Cluster Configuration
To configure the Hadoop cluster, you need to set the running environment for the Hadoop daemon and the running parameters of the Hadoop daemon.
The Hadoop daemon refers to Namenode/datanode and Jobtracker/tasktracker.
Configure the running environment for the Hadoop daemon
The administrator can specifically specify the runtime environment for the Hadoop daemon within the conf/hadoop-env.sh script.
At the very least, you have to set the java_home so that it is set correctly on every remote node.
Administrators can configure each daemon individually by configuring option hadoop_*_opts. The following table is an option that you can configure.
Daemon configuration Options Namenodehadoop_namenode_optsdatanodehadoop_datanode_optssecondarynamenodehadoop_secondarynamenode_ Optsjobtrackerhadoop_jobtracker_optstasktrackerhadoop_tasktracker_opts
For example, to configure Namenode, to enable it to recycle garbage in parallel (PARALLELGC), add the following code to the hadoop-env.sh:
Export hadoop_namenode_opts= "-XX:+USEPARALLELGC ${hadoop_namenode_opts}"
Other common parameters that can be customized include:
Hadoop_log_dir-The directory where the daemon log files are stored. If it does not exist, it will be created automatically. Hadoop_heapsize-Maximum available heap size in MB. For example, 1000MB. This parameter is used to set the heap size of the Hadoop daemon. The default size is 1000MB. Configure the Run-time parameters of the Hadoop daemon
This section deals with important parameters of the Hadoop cluster, which are specified in Conf/hadoop-site.xml.
The URI of the parameter value Note fs.default.nameNameNode. HDFS://Host name/mapred.job.trackerjobtracker (or IP) and port. Host: Port. Dfs.name.dirNameNode The local file system path for persistent storage of namespaces and transaction logs. When this value is a comma-separated list of directories, the NameTable data is replicated to all directories for a redundant backup. Dfs.data.dirDataNode a comma-separated list of local file system paths for storing block data. When this value is a comma-separated list of directories, the data is stored in all directories and is typically distributed across different devices. The Mapred.system.dirmap/reduce framework stores the HDFS path of the system files. Like/hadoop/mapred/system/. This path is the path under the default file system (HDFS) and must be accessible from both the server and the client. Mapred.local.dir a comma-separated list of paths to the local file system, Map/reduce where temporary data is stored. Multipath helps with disk I/O. Mapred.tasktracker. {Map|reduce}.tasks.maximum The maximum number of map/reduce tasks that can be run on a tasktracker, these tasks will run separately. The default is 2 (2 maps and 2 reduce), which can be changed based on hardware conditions. Dfs.hosts/dfs.hosts.exclude license/Reject Datanode list. If necessary, use this file to control the licensed Datanode list. Mapred.hosts/mapred.hosts.exclude license/Reject Tasktracker list. If necessary, use this file to control the licensed Tasktracker list.
Typically, these parameters are marked final to ensure that they are not applied by the user.
real-World cluster configuration
This section lists some of the Non-default configurations that are used to run the sort benchmark test (benchmark) on a large cluster.
Run some Non-default configuration values for the sort900, sort900 the 9TB data on a cluster of 900 nodes:
Parameter value Note dfs.block.size134217728 for large file systems, the HDFS block size is 128MB. DFS.NAMENODE.HANDLER.COUNT40 starts more Namenode service threads to handle RPC requests from a large number of datanode. Mapred.reduce.parallel.copies20reduce launches more parallel copies to get the output of a large number of maps. MAPRED.CHILD.JAVA.OPTS-XMX512M uses a larger heap for map/reduce virtual machines. FS.INMEMORY.SIZE.MB200 allocates more memory for the memory file system required for the merge map output for the reduce phase. io.sort.factor100 files are sorted, more streams are merged at the same time. IO.SORT.MB200 increases the memory limit when sorting. The read/write cache size used in the io.file.buffer.size131072SequenceFile.
The configuration to be updated when running sort1400 and sort2000 is to sort 14TB data on 1400 nodes and to sort 20TB data on 2000 nodes:
parameter value Note mapred.job.tracker.handler.count60 enable more Jobtracker service threads to handle RPC requests from a large number of tasktracker. MAPRED.REDUCE.PARALLEL.COPIES50TASKTRACKER.HTTP.THREADS50 enables more worker threads for the Tasktracker HTTP service. Reduce obtains the intermediate output of the map through the HTTP service. MAPRED.CHILD.JAVA.OPTS-XMX1024M uses larger heaps for maps/reduces virtual machines slaves
Usually, you choose one of the machines in the cluster as the Namenode, the other a different machine as the jobtracker. The remaining machines, as Datanode and Tasktracker, are called slaves.
List all slave hostname or IP address in conf/slaves file, one line.
Log
Hadoop uses the Apache log4j to log logs, which are implemented by the Apache Commons logging framework. Editing the Conf/log4j.properties file can change the log configuration (log format, etc.) of the Hadoop daemon.
history Log
The history file of the job is stored in hadoop.job.history.location, which can also be the path under the Distributed File System, and its default value is ${hadoop_log_dir}/history. A Web UI link with a history log on the Jobtracker Web UI.
The history file is also recorded in the user-specified directory hadoop.job.history.user.location, and the default value for this configuration is the output directory of the job. These files are stored in the "_logs/history/" directory under the specified path. Therefore, by default, the log files are under "mapred.output.dir/_logs/history/". If you specify Hadoop.job.history.user.location as a value of none, the log is no longer logged by the system.
Users can view historical journal totals under a specified path using the following command
$ bin/hadoop job-history Output-dir
This command displays details of the job, the details of the failed and terminated tasks.
More details about the job, such as the successful task, and the number of attempts to do each task can be viewed with the following command
$ bin/hadoop job-history All Output-dir
Once all the necessary configuration is complete, distribute the files to the Hadoop_conf_dir path of all machines, usually ${hadoop_home}/conf.
Rack-aware
of
Hadoop
HDFs and Map/reduce components are capable of perceiving racks.
Namenode and Jobtracker Obtain the rack IDs for each slave in the cluster by invoking the Apiresolve in the Administrator configuration module. The API converts the slave DNS name (or IP address) to a rack ID. Which module is used is specified through the configuration item Topology.node.switch.mapping.impl. The default implementation of the module invokes a script/command specified by the Topology.script.file.name configuration entry. If Topology.script.file.name is not set, the module returns/DEFAULT-RACK as the rack ID for all incoming IP addresses. There is also an additional configuration item mapred.cache.task.levels in the Map/reduce section, which determines the cache's progression (in the network topology). For example, if the default value is 2, a level two cache-level is established for the host (the mapping of the host-> Task) Another level for the rack (rack-> task mapping).
start Hadoop
Starting the Hadoop cluster requires starting the HDFs cluster and the Map/reduce cluster.
Format a new Distributed File system:
$ bin/hadoop Namenode-format
On the assigned Namenode, run the following command to start the HDFs:
$ bin/start-dfs.sh
The bin/start-dfs.sh script starts the Datanode daemon on all listed slave, referencing the contents of the ${hadoop_conf_dir}/slaves file on Namenode.
On the assigned Jobtracker, run the following command to start the map/reduce:
$ bin/start-mapred.sh
The bin/start-mapred.sh script starts the Tasktracker daemon on all listed slave, referencing the contents of the ${hadoop_conf_dir}/slaves file on Jobtracker.
Stop Hadoop
On the assigned Namenode, execute the following command to stop HDFs:
$ bin/stop-dfs.sh
The bin/stop-dfs.sh script stops the Datanode daemon on all listed slave, referencing the contents of the ${hadoop_conf_dir}/slaves file on Namenode.
On the assigned Jobtracker, run the following command to stop Map/reduce:
$ bin/stop-mapred.sh
The bin/stop-mapred.sh script stops the Tasktracker daemon on all listed slave, referencing the contents of the ${hadoop_conf_dir}/slaves file on Jobtracker.