Purpose
This article describes how to install, configure, and manage a meaningful hadoop cluster that can scale from a small cluster of several nodes to a large cluster of thousands of nodes.
If you want to install Hadoop on a single machine, you can find the details here. Prerequisites Ensure that all required software is installed on each node in your cluster. Get the Hadoop package. Installation
Installing a Hadoop cluster typically extracts the installation software to all the machines in the cluster.
Typically, one machine in the cluster is designated as NameNode, and the other machine is designated as Jobtracker. These machines are masters. The remaining machines act as datanode as well as Tasktracker. These machines are slaves.
we use Hadoop_home to refer to the root path of the installation. In general, all machines in a cluster have the same hadoop_home path. Configuration
the next sections describe how to configure a Hadoop cluster. configuration file
The configuration of Hadoop is done through two important profiles under the conf/directory: hadoop-default.xml-read-only default configuration. Hadoop-site.xml-a cluster-specific configuration.
To learn more about how these profiles affect the Hadoop framework, see here.
In addition, you can control the Hadoop script under the bin/directory by setting the variables in the conf/hadoop-env.sh to the values that are unique to the cluster. cluster configuration
To configure a Hadoop cluster, you need to set the operating environment of the Hadoop daemon and the running parameters of the Hadoop daemon.
the Hadoop daemon refers to Namenode/datanode and Jobtracker/tasktracker. Configure the runtime environment for the Hadoop daemon
Administrators can specify the operating environment of the Hadoop daemon within the conf/hadoop-env.sh script.
At the very least, you have to set the Java_home to be set correctly on each remote node.
Administrators can configure each daemon individually by configuring the option hadoop_*_opts. The following table is an option that you can configure.
Daemon Process |
configuration Options |
NameNode |
Hadoop_namenode_opts |
DataNode |
Hadoop_datanode_opts |
Secondarynamenode |
Hadoop_secondarynamenode_opts |
Jobtracker |
Hadoop_jobtracker_opts |
Tasktracker |
Hadoop_tasktracker_opts |
For example, when configuring Namenode, to enable it to recycle garbage (PARALLELGC) in parallel, add the following code to hadoop-env.sh:
Export hadoop_namenode_opts= "-XX:+USEPARALLELGC ${hadoop_namenode_opts}"
Other common parameters that can be customized include: Hadoop_log_dir-the directory where the daemon log files are stored. It is created automatically if it does not exist. Hadoop_heapsize-the maximum available heap size, in megabytes. For example, 1000MB. This parameter is used to set the heap size of the Hadoop daemon. The default size is 1000MB. Configure running parameters for the Hadoop daemon
This section covers important parameters of the Hadoop cluster, which are specified in Conf/hadoop-site.xml.
Parameters |
Take value |
Notes |
Fs.default.name |
The URI of the Namenode. |
HDFS://Host name/ |
Mapred.job.tracker |
The host (or IP) and port of the Jobtracker. |
Host: Port. |
Dfs.name.dir |
Namenode a local file system path that stores the namespace and transaction log persistently. |
When this value is a comma-delimited list of directories, the NameTable data will be copied to all directories for redundant backups. |
Dfs.data.dir |
Datanode The local file system path that holds the block data, and a comma-separated list. |
When this value is a comma-separated list of directories, the data will be stored in all directories, usually distributed on different devices. |
Mapred.system.dir |
Map/reduce the HDFS path to the framework storage system files. Like/hadoop/mapred/system/. |
This path is the path under the default file system (HDFS) and must be accessible from both the server and the client. |
Mapred.local.dir |
A comma-delimited list of paths under the local file system, where temporary data is stored map/reduce. |
Multipath helps take advantage of disk I/O. |
Mapred.tasktracker. {Map|reduce}.tasks.maximum |
The maximum number of map/reduce tasks that can be run on a tasktracker, and these tasks will run concurrently. |
The default is 2 (2 map and 2 reduce), which can be changed according to the hardware condition. |
Dfs.hosts/dfs.hosts.exclude |
License/Deny Datanode list. |
If necessary, use this file to control the list of licensed Datanode. |
Mapred.hosts/mapred.hosts.exclude |
License/Deny Tasktracker list. |
If necessary, use this file to control the list of licensed Tasktracker. |
Typically, the above parameters are marked as final to ensure that they are not applied by the user. Real-world cluster configuration
This section lists some of the non-default configurations that are used when running the sort benchmark (benchmark) on a large cluster.
To run some non-default configuration values for sort900, the sort900 is to sort the 9TB data on a 900-node cluster:
|
value |
remarks |
dfs.block.size |
134217728 |
for large file systems, the block size of HDFs takes 128MB. |
dfs.namenode.handler.count |
up |
start more Namenode service threads to handle RPC from a large number of Datanode Please. |
mapred.reduce.parallel.copies |
|
reduce to launch more parallel copies to get the output of a large number of maps. |
mapred.child.java.opts |
-xmx512m |
use a larger heap for map/reduce child virtual machines. |
fs.inmemory.size.mb |
+ |
allocates more memory for the memory file system required for the reduce phase to merge the map output. |
io.sort.factor |
to |
The file is sorted when more streams are merged at the same time. |
io.sort.mb |
$ |
increase the upper memory limit when sorting. |
io.file.buffer.size The read/write cache size used in |
131072 |
sequencefile. |
The configuration that needs to be updated when running sort1400 and sort2000, which is to sort 14TB of data on 1400 nodes and to sort 20TB of data on 2000 nodes:
parameter |
value |
remarks |
mapred.job.tracker.handler.count |
a |
enable more Jobtracker service threads to handle from a large number of Taskt RPC Request for Racker. |
mapred.reduce.parallel.copies |
|
|
tasktracker.http.threads |
The |
enables more worker threads for the HTTP service of Tasktracker. Reduce obtains the intermediate output of the map through the HTTP service. |
mapred.child.java.opts |
-xmx1024m |
use a larger heap for maps/reduces sub-virtual Machine |
slaves Typically, you choose one machine in the cluster as the Namenode, and the other a different machine as the jobtracker. The rest of the machines, as Datanode and as Tasktracker, are called slaves.
in the Conf/slaves file, list the hostname or IP address of all slave, one line at a. Log
Hadoop uses Apache log4j to log logs, which are implemented by the Apache Commons logging framework. Editing the Conf/log4j.properties file can change the log configuration (log format, etc.) of the Hadoop daemon. history log
The job's history file is centrally located in Hadoop.job.history.location, which can also be a path under the Distributed File System, with a default value of ${hadoop_log_dir}/history. There is a Web UI link to the history log on the Jobtracker Web UI.
The history file is also recorded in the user-specified directory hadoop.job.history.user.location, and the default value for this configuration is the output directory of the job. These files are stored in the "_logs/history/" directory under the specified path. Therefore, by default, the log file is under "mapred.output.dir/_logs/history/". If Hadoop.job.history.user.location is specified as a value of none, this log will no longer be logged.
The user can view the history log summary under the specified path using the following command
$ bin/hadoop job-history Output-dir
This command displays details of the job, the details of the failed and terminated tasks.
More details about the job, such as the successful task, and the number of attempts made on each task, can be viewed with the following command
$ bin/hadoop job-history All Output-dir
one but all of the necessary configuration is done, distributing these files to the Hadoop_conf_dir path of all machines, usually ${hadoop_home}/conf. Rack Awareness for Hadoop
The components of HDFs and map/reduce are capable of sensing the rack.
Namenode and Jobtracker get the rack ID for each slave in the cluster by invoking the Apiresolve in the Administrator configuration module. The API converts the slave DNS name (or IP address) into a rack ID. Which module to use is specified by the configuration item Topology.node.switch.mapping.impl. The default implementation of the module invokes one of the scripts/commands specified by the Topology.script.file.name configuration item. If Topology.script.file.name is not set, the module returns/DEFAULT-RACK as the rack ID for all incoming IP addresses. In the Map/reduce section there is an additional configuration item mapred.cache.task.levels, which determines the series of caches (in the network topology). For example, if the default value is 2, a level two cache-is established for the host (host-and task-based mapping) at the other level for the rack (task-to-rack mapping). start Hadoop
Starting a Hadoop cluster requires that the HDFS cluster and the Map/reduce cluster be started.
To format a new Distributed File system:
$ bin/hadoop Namenode-format
On the assigned Namenode, run the following command to start HDFs:
$ bin/start-dfs.sh
The bin/start-dfs.sh script launches the Datanode daemon on all listed slave, referring to the contents of the ${hadoop_conf_dir}/slaves file on Namenode.
On the assigned Jobtracker, run the following command to start Map/reduce:
$ bin/start-mapred.sh
The bin/start-mapred.sh script launches the Tasktracker daemon on all listed slave, referring to the contents of the ${hadoop_conf_dir}/slaves file on Jobtracker. stop Hadoop
On the assigned Namenode, execute the following command to stop HDFs:
$ bin/stop-dfs.sh
The bin/stop-dfs.sh script will stop the Datanode daemon on all listed slave, referring to the contents of the ${hadoop_conf_dir}/slaves file on Namenode.