International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Hadoop cluster Building (2)

Last Update:2018-07-26 Source: Internet

Author: User

Tags benchmark log4j

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Purpose

This article describes how to install, configure, and manage a meaningful hadoop cluster that can scale from a small cluster of several nodes to a large cluster of thousands of nodes.

If you want to install Hadoop on a single machine, you can find the details here. Prerequisites Ensure that all required software is installed on each node in your cluster. Get the Hadoop package. Installation

Installing a Hadoop cluster typically extracts the installation software to all the machines in the cluster.

Typically, one machine in the cluster is designated as NameNode, and the other machine is designated as Jobtracker. These machines are masters. The remaining machines act as datanode as well as Tasktracker. These machines are slaves.

we use Hadoop_home to refer to the root path of the installation. In general, all machines in a cluster have the same hadoop_home path. Configuration

the next sections describe how to configure a Hadoop cluster. configuration file

The configuration of Hadoop is done through two important profiles under the conf/directory: hadoop-default.xml-read-only default configuration. Hadoop-site.xml-a cluster-specific configuration.

To learn more about how these profiles affect the Hadoop framework, see here.

In addition, you can control the Hadoop script under the bin/directory by setting the variables in the conf/hadoop-env.sh to the values that are unique to the cluster. cluster configuration

To configure a Hadoop cluster, you need to set the operating environment of the Hadoop daemon and the running parameters of the Hadoop daemon.

the Hadoop daemon refers to Namenode/datanode and Jobtracker/tasktracker. Configure the runtime environment for the Hadoop daemon

Administrators can specify the operating environment of the Hadoop daemon within the conf/hadoop-env.sh script.

At the very least, you have to set the Java_home to be set correctly on each remote node.

Administrators can configure each daemon individually by configuring the option hadoop_*_opts. The following table is an option that you can configure.

Daemon Process	configuration Options
NameNode	Hadoop_namenode_opts
DataNode	Hadoop_datanode_opts
Secondarynamenode	Hadoop_secondarynamenode_opts
Jobtracker	Hadoop_jobtracker_opts
Tasktracker	Hadoop_tasktracker_opts

For example, when configuring Namenode, to enable it to recycle garbage (PARALLELGC) in parallel, add the following code to hadoop-env.sh:
Export hadoop_namenode_opts= "-XX:+USEPARALLELGC ${hadoop_namenode_opts}"

Other common parameters that can be customized include: Hadoop_log_dir-the directory where the daemon log files are stored. It is created automatically if it does not exist. Hadoop_heapsize-the maximum available heap size, in megabytes. For example, 1000MB. This parameter is used to set the heap size of the Hadoop daemon. The default size is 1000MB. Configure running parameters for the Hadoop daemon

This section covers important parameters of the Hadoop cluster, which are specified in Conf/hadoop-site.xml.

Parameters	Take value	Notes
Fs.default.name	The URI of the Namenode.	HDFS://Host name/
Mapred.job.tracker	The host (or IP) and port of the Jobtracker.	Host: Port.
Dfs.name.dir	Namenode a local file system path that stores the namespace and transaction log persistently.	When this value is a comma-delimited list of directories, the NameTable data will be copied to all directories for redundant backups.
Dfs.data.dir	Datanode The local file system path that holds the block data, and a comma-separated list.	When this value is a comma-separated list of directories, the data will be stored in all directories, usually distributed on different devices.
Mapred.system.dir	Map/reduce the HDFS path to the framework storage system files. Like/hadoop/mapred/system/.	This path is the path under the default file system (HDFS) and must be accessible from both the server and the client.
Mapred.local.dir	A comma-delimited list of paths under the local file system, where temporary data is stored map/reduce.	Multipath helps take advantage of disk I/O.
Mapred.tasktracker. {Map\|reduce}.tasks.maximum	The maximum number of map/reduce tasks that can be run on a tasktracker, and these tasks will run concurrently.	The default is 2 (2 map and 2 reduce), which can be changed according to the hardware condition.
Dfs.hosts/dfs.hosts.exclude	License/Deny Datanode list.	If necessary, use this file to control the list of licensed Datanode.
Mapred.hosts/mapred.hosts.exclude	License/Deny Tasktracker list.	If necessary, use this file to control the list of licensed Tasktracker.

Typically, the above parameters are marked as final to ensure that they are not applied by the user. Real-world cluster configuration

This section lists some of the non-default configurations that are used when running the sort benchmark (benchmark) on a large cluster.

To run some non-default configuration values for sort900, the sort900 is to sort the 9TB data on a 900-node cluster:

	value	remarks
dfs.block.size	134217728	for large file systems, the block size of HDFs takes 128MB.
dfs.namenode.handler.count	up	start more Namenode service threads to handle RPC from a large number of Datanode Please.
mapred.reduce.parallel.copies		reduce to launch more parallel copies to get the output of a large number of maps.
mapred.child.java.opts	-xmx512m	use a larger heap for map/reduce child virtual machines.
fs.inmemory.size.mb	+	allocates more memory for the memory file system required for the reduce phase to merge the map output.
io.sort.factor	to	The file is sorted when more streams are merged at the same time.
io.sort.mb	$	increase the upper memory limit when sorting.
io.file.buffer.size The read/write cache size used in	131072	sequencefile.

The configuration that needs to be updated when running sort1400 and sort2000, which is to sort 14TB of data on 1400 nodes and to sort 20TB of data on 2000 nodes:

parameter	value	remarks
mapred.job.tracker.handler.count	a	enable more Jobtracker service threads to handle from a large number of Taskt RPC Request for Racker.
mapred.reduce.parallel.copies
tasktracker.http.threads	The	enables more worker threads for the HTTP service of Tasktracker. Reduce obtains the intermediate output of the map through the HTTP service.
mapred.child.java.opts	-xmx1024m	use a larger heap for maps/reduces sub-virtual Machine

slaves

Typically, you choose one machine in the cluster as the Namenode, and the other a different machine as the jobtracker. The rest of the machines, as Datanode and as Tasktracker, are called slaves.

in the Conf/slaves file, list the hostname or IP address of all slave, one line at a. Log

Hadoop uses Apache log4j to log logs, which are implemented by the Apache Commons logging framework. Editing the Conf/log4j.properties file can change the log configuration (log format, etc.) of the Hadoop daemon. history log

The job's history file is centrally located in Hadoop.job.history.location, which can also be a path under the Distributed File System, with a default value of ${hadoop_log_dir}/history. There is a Web UI link to the history log on the Jobtracker Web UI.

The history file is also recorded in the user-specified directory hadoop.job.history.user.location, and the default value for this configuration is the output directory of the job. These files are stored in the "_logs/history/" directory under the specified path. Therefore, by default, the log file is under "mapred.output.dir/_logs/history/". If Hadoop.job.history.user.location is specified as a value of none, this log will no longer be logged.

The user can view the history log summary under the specified path using the following command
$ bin/hadoop job-history Output-dir
This command displays details of the job, the details of the failed and terminated tasks.
More details about the job, such as the successful task, and the number of attempts made on each task, can be viewed with the following command
$ bin/hadoop job-history All Output-dir

one but all of the necessary configuration is done, distributing these files to the Hadoop_conf_dir path of all machines, usually ${hadoop_home}/conf. Rack Awareness for Hadoop

The components of HDFs and map/reduce are capable of sensing the rack.

Namenode and Jobtracker get the rack ID for each slave in the cluster by invoking the Apiresolve in the Administrator configuration module. The API converts the slave DNS name (or IP address) into a rack ID. Which module to use is specified by the configuration item Topology.node.switch.mapping.impl. The default implementation of the module invokes one of the scripts/commands specified by the Topology.script.file.name configuration item. If Topology.script.file.name is not set, the module returns/DEFAULT-RACK as the rack ID for all incoming IP addresses. In the Map/reduce section there is an additional configuration item mapred.cache.task.levels, which determines the series of caches (in the network topology). For example, if the default value is 2, a level two cache-is established for the host (host-and task-based mapping) at the other level for the rack (task-to-rack mapping). start Hadoop

Starting a Hadoop cluster requires that the HDFS cluster and the Map/reduce cluster be started.

To format a new Distributed File system:
$ bin/hadoop Namenode-format

On the assigned Namenode, run the following command to start HDFs:
$ bin/start-dfs.sh

The bin/start-dfs.sh script launches the Datanode daemon on all listed slave, referring to the contents of the ${hadoop_conf_dir}/slaves file on Namenode.

On the assigned Jobtracker, run the following command to start Map/reduce:
$ bin/start-mapred.sh

The bin/start-mapred.sh script launches the Tasktracker daemon on all listed slave, referring to the contents of the ${hadoop_conf_dir}/slaves file on Jobtracker. stop Hadoop

On the assigned Namenode, execute the following command to stop HDFs:
$ bin/stop-dfs.sh

The bin/stop-dfs.sh script will stop the Datanode daemon on all listed slave, referring to the contents of the ${hadoop_conf_dir}/slaves file on Namenode.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

hadoop cluster capacity planning free hadoop cluster online free hadoop cluster access install hadoop cluster free hadoop cluster hadoop cluster setup hadoop cluster tutorial

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop cluster Building (2)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support