Hadoop cluster Building (2)

Source: Internet
Author: User
Tags benchmark log4j
Purpose

This article describes how to install, configure, and manage a meaningful hadoop cluster that can scale from a small cluster of several nodes to a large cluster of thousands of nodes.

If you want to install Hadoop on a single machine, you can find the details here. Prerequisites Ensure that all required software is installed on each node in your cluster. Get the Hadoop package. Installation

Installing a Hadoop cluster typically extracts the installation software to all the machines in the cluster.

Typically, one machine in the cluster is designated as NameNode, and the other machine is designated as Jobtracker. These machines are masters. The remaining machines act as datanode as well as Tasktracker. These machines are slaves.

we use Hadoop_home to refer to the root path of the installation. In general, all machines in a cluster have the same hadoop_home path. Configuration

the next sections describe how to configure a Hadoop cluster. configuration file

The configuration of Hadoop is done through two important profiles under the conf/directory: hadoop-default.xml-read-only default configuration. Hadoop-site.xml-a cluster-specific configuration.

To learn more about how these profiles affect the Hadoop framework, see here.

In addition, you can control the Hadoop script under the bin/directory by setting the variables in the conf/hadoop-env.sh to the values that are unique to the cluster. cluster configuration

To configure a Hadoop cluster, you need to set the operating environment of the Hadoop daemon and the running parameters of the Hadoop daemon.

the Hadoop daemon refers to Namenode/datanode and Jobtracker/tasktracker. Configure the runtime environment for the Hadoop daemon

Administrators can specify the operating environment of the Hadoop daemon within the conf/hadoop-env.sh script.

At the very least, you have to set the Java_home to be set correctly on each remote node.

Administrators can configure each daemon individually by configuring the option hadoop_*_opts. The following table is an option that you can configure.

Daemon Process configuration Options
NameNode Hadoop_namenode_opts
DataNode Hadoop_datanode_opts
Secondarynamenode Hadoop_secondarynamenode_opts
Jobtracker Hadoop_jobtracker_opts
Tasktracker Hadoop_tasktracker_opts

For example, when configuring Namenode, to enable it to recycle garbage (PARALLELGC) in parallel, add the following code to hadoop-env.sh:
Export hadoop_namenode_opts= "-XX:+USEPARALLELGC ${hadoop_namenode_opts}"

Other common parameters that can be customized include: Hadoop_log_dir-the directory where the daemon log files are stored. It is created automatically if it does not exist. Hadoop_heapsize-the maximum available heap size, in megabytes. For example, 1000MB. This parameter is used to set the heap size of the Hadoop daemon. The default size is 1000MB. Configure running parameters for the Hadoop daemon

This section covers important parameters of the Hadoop cluster, which are specified in Conf/hadoop-site.xml.

Parameters Take value Notes
Fs.default.name The URI of the Namenode. HDFS://Host name/
Mapred.job.tracker The host (or IP) and port of the Jobtracker. Host: Port.
Dfs.name.dir Namenode a local file system path that stores the namespace and transaction log persistently. When this value is a comma-delimited list of directories, the NameTable data will be copied to all directories for redundant backups.
Dfs.data.dir Datanode The local file system path that holds the block data, and a comma-separated list. When this value is a comma-separated list of directories, the data will be stored in all directories, usually distributed on different devices.
Mapred.system.dir Map/reduce the HDFS path to the framework storage system files. Like/hadoop/mapred/system/. This path is the path under the default file system (HDFS) and must be accessible from both the server and the client.
Mapred.local.dir A comma-delimited list of paths under the local file system, where temporary data is stored map/reduce. Multipath helps take advantage of disk I/O.
Mapred.tasktracker. {Map|reduce}.tasks.maximum The maximum number of map/reduce tasks that can be run on a tasktracker, and these tasks will run concurrently. The default is 2 (2 map and 2 reduce), which can be changed according to the hardware condition.
Dfs.hosts/dfs.hosts.exclude License/Deny Datanode list. If necessary, use this file to control the list of licensed Datanode.
Mapred.hosts/mapred.hosts.exclude License/Deny Tasktracker list. If necessary, use this file to control the list of licensed Tasktracker.

Typically, the above parameters are marked as final to ensure that they are not applied by the user. Real-world cluster configuration

This section lists some of the non-default configurations that are used when running the sort benchmark (benchmark) on a large cluster.

To run some non-default configuration values for sort900, the sort900 is to sort the 9TB data on a 900-node cluster:

value remarks
dfs.block.size 134217728 for large file systems, the block size of HDFs takes 128MB.
dfs.namenode.handler.count up start more Namenode service threads to handle RPC from a large number of Datanode Please.
mapred.reduce.parallel.copies reduce to launch more parallel copies to get the output of a large number of maps.
mapred.child.java.opts -xmx512m use a larger heap for map/reduce child virtual machines.
fs.inmemory.size.mb + allocates more memory for the memory file system required for the reduce phase to merge the map output.
io.sort.factor to The file is sorted when more streams are merged at the same time.
io.sort.mb $ increase the upper memory limit when sorting.
io.file.buffer.size The read/write cache size used in 131072 sequencefile.

The configuration that needs to be updated when running sort1400 and sort2000, which is to sort 14TB of data on 1400 nodes and to sort 20TB of data on 2000 nodes:

parameter value remarks
mapred.job.tracker.handler.count a enable more Jobtracker service threads to handle from a large number of Taskt RPC Request for Racker.
mapred.reduce.parallel.copies  
tasktracker.http.threads The enables more worker threads for the HTTP service of Tasktracker. Reduce obtains the intermediate output of the map through the HTTP service.
mapred.child.java.opts -xmx1024m use a larger heap for maps/reduces sub-virtual Machine
slaves

Typically, you choose one machine in the cluster as the Namenode, and the other a different machine as the jobtracker. The rest of the machines, as Datanode and as Tasktracker, are called slaves.

in the Conf/slaves file, list the hostname or IP address of all slave, one line at a. Log

Hadoop uses Apache log4j to log logs, which are implemented by the Apache Commons logging framework. Editing the Conf/log4j.properties file can change the log configuration (log format, etc.) of the Hadoop daemon. history log

The job's history file is centrally located in Hadoop.job.history.location, which can also be a path under the Distributed File System, with a default value of ${hadoop_log_dir}/history. There is a Web UI link to the history log on the Jobtracker Web UI.

The history file is also recorded in the user-specified directory hadoop.job.history.user.location, and the default value for this configuration is the output directory of the job. These files are stored in the "_logs/history/" directory under the specified path. Therefore, by default, the log file is under "mapred.output.dir/_logs/history/". If Hadoop.job.history.user.location is specified as a value of none, this log will no longer be logged.

The user can view the history log summary under the specified path using the following command
$ bin/hadoop job-history Output-dir
This command displays details of the job, the details of the failed and terminated tasks.
More details about the job, such as the successful task, and the number of attempts made on each task, can be viewed with the following command
$ bin/hadoop job-history All Output-dir

one but all of the necessary configuration is done, distributing these files to the Hadoop_conf_dir path of all machines, usually ${hadoop_home}/conf. Rack Awareness for Hadoop

The components of HDFs and map/reduce are capable of sensing the rack.

Namenode and Jobtracker get the rack ID for each slave in the cluster by invoking the Apiresolve in the Administrator configuration module. The API converts the slave DNS name (or IP address) into a rack ID. Which module to use is specified by the configuration item Topology.node.switch.mapping.impl. The default implementation of the module invokes one of the scripts/commands specified by the Topology.script.file.name configuration item. If Topology.script.file.name is not set, the module returns/DEFAULT-RACK as the rack ID for all incoming IP addresses. In the Map/reduce section there is an additional configuration item mapred.cache.task.levels, which determines the series of caches (in the network topology). For example, if the default value is 2, a level two cache-is established for the host (host-and task-based mapping) at the other level for the rack (task-to-rack mapping). start Hadoop

Starting a Hadoop cluster requires that the HDFS cluster and the Map/reduce cluster be started.

To format a new Distributed File system:
$ bin/hadoop Namenode-format

On the assigned Namenode, run the following command to start HDFs:
$ bin/start-dfs.sh

The bin/start-dfs.sh script launches the Datanode daemon on all listed slave, referring to the contents of the ${hadoop_conf_dir}/slaves file on Namenode.

On the assigned Jobtracker, run the following command to start Map/reduce:
$ bin/start-mapred.sh

The bin/start-mapred.sh script launches the Tasktracker daemon on all listed slave, referring to the contents of the ${hadoop_conf_dir}/slaves file on Jobtracker. stop Hadoop

On the assigned Namenode, execute the following command to stop HDFs:
$ bin/stop-dfs.sh

The bin/stop-dfs.sh script will stop the Datanode daemon on all listed slave, referring to the contents of the ${hadoop_conf_dir}/slaves file on Namenode.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.