Hadoop cluster construction Summary

Source: Internet
Author: User
Tags xsl root access

Generally, one machine in the cluster is specified as namenode, and another machine is specified as jobtracker. These machines areMasters. The remaining Machines serve as datanodeAlsoAs tasktracker. These machines areSlaves

Official Address :(Http://hadoop.apache.org/common/docs/r0.19.2/cn/cluster_setup.html)

1 prerequisites
    1. Make sure that all required software is installed on each node of your cluster: Sun-JDK, ssh, hadoop

    2. Javatm 1.5.x must be installed. We recommend that you select the Java version released by Sun.

    3. SSHMust be installed and guaranteedSshdRun all the time to use hadoop scripts to manage the remote hadoop daemon.

2. experiment environment setup 2.1 preparations

Operating System: Ubuntu
Deployment: vmvare
After a Ubuntu virtual machine is installed in vmvare, you can export or clone the other two virtual machines.
Note:
Ensure that the IP address of the Virtual Machine and the IP address of the host are in the same IP segment, so that several virtual machines and hosts can communicate with each other.
To ensure that the virtual machine IP address and the Host IP address are in the same IP segment, the virtual machine connection is set as a bridge connection.

Prepare a machine: one master and several slave machines. Configure the/etc/hosts of each machine to ensure mutual access between machines by machine name. For example:
10.64.56.76 node1 (master)
10.64.56.77 node2 (slave1)
10.64.56.78 node3 (slave2)
Host information:

 

Machine name IP address Function
Node1 10.64.56.76 Namenode and jobtracker
Node2 10.64.56.77 Datanode and tasktracker
Node3 10.64.56.78 Datanode and tasktracker

Install JDK and SSH to ensure environment consistency: 

2.2 install JDK

# Install JDK
$ Sudo apt-Get install sun-java6-jdk1.2.3
After this installation, the Java execution file is automatically added to the/usr/bin/directory.
Verify the shell command: Java-version to see if it is consistent with your version number. 

2.3 download and create a user

$ Useradd hadoop
$ CD/home/hadoop

Create the same directory on all machines. You can also create the same user. It is best to use the home path of the user as the hadoop installation path.
For example, the installation path on all machines is:/home/hadoop/hadoop-0.20.203, this does not need mkdir, in/home/hadoop/decompress the hadoop package, will automatically generate)
(Of course you can install/usr/local/directory, such as/usr/local/hadoop-0.20.203/
Chown-r hadoop/usr/local/hadoop-0.20.203/
Chgrp-r hadoop/usr/local/hadoop-0.20.203/
)
(We recommend that you do not use root for installation, because root access is not recommended between machines)

2.4 Install SSH and configure

1)Install: Sudo apt-Get Install SSH

After installation, you can directly use the SSH command.
Run $ netstat-nat to check whether port 22 is enabled.
Test: SSH localhost.
Enter the password of the current user and press Enter. It indicates that the installation is successful, and the password is required for SSH logon.
(After this default installation method is complete, the default configuration file is in the/etc/ssh/directory. The sshd configuration file is:/etc/ssh/sshd_config ): 

Note: SSH must be installed on all hosts.

2)Configuration:

After hadoop is started, namenode starts and stops various daemon on each datanode through SSH (Secure Shell, this requires that you do not need to enter a password when executing commands between nodes. Therefore, we need to configure SSH to use the password-free public key authentication form.
Take the three machines in this article as an example. Now node1 is the master node, and it needs to connect node2 and node3. Make sure that SSH is installed on each machine, and the sshd service on the datanode machine has been started.

(Description: hadoop @ hadoop ~] $ Ssh-keygen-T RSA
This command will generate a key pair for hadoop users. When asking about the storage path, press enter to use the default path. When prompted to enter passphrase for the generated key, press enter directly, that is, set it to an empty password. The generated key pairs id_rsa and id_rsa.pub are stored in/home/hadoop/by default /. in the SSH directory, copy the id_rsa.pub content to/home/dbrg/on each machine (including the local machine /. in the ssh/authorized_keys file, if there is already an authorized_keys file on the machine, add the content in id_rsa.pub to the end of the file. If there is no authorized_keys file, just copy it .)

3) First, set SSH for namenode to automatically log on without a password.

Switch to hadoop users (ensure that users can log on to hadoop without a password, because the hadoop owner we install later is a hadoop user .)

$ Su hadoop

CD/home/hadoop

$ Ssh-keygen-T RSA

Then press Enter.
After that, the hidden folder. SSH will be generated in the home and directory. 

$ Cd. SSH

Then ls to view the file

CP id_rsa.pub authorized_keys

Test:

$ SSH localhost

Or:

$ SSH node1

The first time SSH is displayed, a message is displayed:

The authenticity of host 'node1 (10.64.56.76) 'can't be established.
RSA key fingerprint is 03: E0: 30: CB: 6e: 13: A8: 70: C9: 7e: Cf: FF: 33: 2a: 67: 30.
Are you sure you want to continue connecting (Yes/No )?

Enter yes to continue. This will add the server to the list of known hosts.

The link is found to be successful and no password is required.

4)CopyAuthorized_keysTo node2 and node3

To ensure that node1 can automatically log on to node2 and node3 without a password, run the command on node2 and node3 first.

$ Su hadoop

CD/home/hadoop

$ Ssh-keygen-T RSA 

Press enter.
Return to node1 and copy authorized_keys to node2 and node3.

[Hadoop @ hadoop. Ssh] $ SCP authorized_keys node2:/home/hadoop/. Ssh/

[Hadoop @ hadoop. Ssh] $ SCP authorized_keys node3:/home/hadoop/. Ssh/ 

Enter the password and the hadoop account password.
Change the permission of your authorized_keys File

[Hadoop @ hadoop. Ssh] $ chmod 644 authorized_keys

Test: SSH node2 or SSH node3 (enter yes for the first time ).
If you do not need to enter a password, the configuration is successful. If you need to enter a password, check whether the above configuration is correct. 

2.5 install hadoop

# Switching to a hadoop user

Su hadoop

Wgethttp: // response

After downloading the installation package, decompress the package to install it:

$ Tar -zxvfhadoop-0.511203.0rc1.tar.gz 

1) to install a hadoop cluster, extract the installation software to all machines in the cluster. And the installation path should be consistent. If we use hadoop_home to refer to the root path for installation, generally,
The hadoop_home path is the same.
2) If the environment of machines in the cluster is the same, you can configure on one machine, and then copy the configured software, that is, the entire folder of the hadoop-0.20.203, to the same location of other machines.
3) hadoop on the master can be copied to the same directory of each slave through SCP, and its hadoop-env.sh can be modified according to the different java_home of each slave.
4) for convenience, use commands such as hadoop command or start-all.sh, modify the/etc/profile on the master to add the following content:
Export hadoop_home =/home/hadoop/hadoop-0.20.203
Exportpath = $ path: $ hadoop_home/bin
After modification, execute source/etc/profile to make it take effect.

6) configure the conf/hadoop-env.sh File 

Configure the conf/hadoop-env.sh File
# AddExport java_home =/usr/lib/JVM/Java-6-sun/

Here, we will change it to your JDK installation location.

Test hadoop installation:Bin/hadoop jar hadoop-0.20.2-examples.jarwordcount conf // tmp/out

3. cluster configuration (all nodes are the same) 3.1 profile: CONF/core-site.xml

<? XML version = "1.0"?>
<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>
<Configuration>
<Property>
<Name> fs. Default. Name </Name>
<Value> HDFS: // Node 1: 49000 </value>
</Property>
<Property>
<Name> hadoop. tmp. dir </Name>
<Value>/home/hadoop/hadoop_home/var </value>
</Property>
</Configuration> 

1) fs. Default. Name is the URI of namenode. HDFS: // host name: Port/
2) hadoop. TMP. dir: the default temporary path of hadoop. It is recommended that you delete the tmp directory in this file if a newly added node or another unknown datanode cannot be started. However, if the directory of the namenode machine is deleted, you need to re-execute the namenode formatting command. 

3.2 configuration file: CONF/mapred-site.xml

<? Xmlversion = "1.0"?>
<? XML-stylesheettype = "text/XSL" href = "configuration. XSL"?>
<Configuration>
<Property>
<Name> mapred. Job. Tracker </Name>
<Value> node1. 49001 </value>
</Property>
<Property>
<Name> mapred. Local. dir </Name>
<Value>/home/hadoop/hadoop_home/var </value>
</Property>
</Configuration> 

1) mapred. Job. tracker is the host (or IP) and Port of jobtracker. HOST: port.

3.3 configuration file: CONF/hdfs-site.xml





DFS. name. dir
/home/hadoop/name1, /home/hadoop/name2 # hadoop name directory path


DFS. data. dir
/home/hadoop/data1,/home/hadoop/data2



DFS. replication

2

1) DFS. Name. DIR is the local file system path for namenode to persistently store namespace and transaction logs. When this value is a comma-separated directory list, the nametable data will be copied to all directories for redundant backup.
2) DFS. Data. DIR is the path of the local file system where datanode stores block data. It is a comma-separated list. When this value is a comma-separated directory list, data is stored in all directories and usually distributed across different devices.
3) DFS. replication is the number of data to be backed up. The default value is 3. An error occurs if the number is greater than the number of machines in the cluster. 

Note: The name1, name2, data1, and data2 directories cannot be created in advance. They are automatically created during hadoop formatting. If they are created in advance, problems may occur.

3.4 configure master and slave nodes of the masters and slaves

Configure CONF/masters and CONF/slaves to set the master and slave nodes. Note that it is best to use the host name and ensure that machines can access each other through the host name. Each host name has one row.

VI masters:
Input:

Node1

VI slaves:

Input:
Node2
Node3

Configuration ended,Copy the configured hadoop folder to machines in other clusters.And make sure that the above configuration is correct for other machines, for example, if the Java installation paths for other machines are different, modify the conf/hadoop-env.sh

$ SCP-r/home/hadoop/The hadoop-0.20.203 root @ node2:/home/hadoop/ 

4 hadoop starts 4.1 to format A New Distributed File System

First format A New Distributed File System

$ Hadoop-0.20.203 CD
$ Bin/hadoop namenode-format
If the result is successful, the system outputs:

12/02/06 00:46:50 info namenode. namenode: startup_msg:
/*************************************** *********************
Startup_msg: Starting namenode
Startup_msg: host = Ubuntu/127.0.1.1
Startup_msg: ARGs = [-format]
Startup_msg: version = 0.20.203.0
Startup_msg: Build = http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203-r 1099333; compiled by 'oom 'on Wed May 4 07:57:50 PDT 2011
**************************************** ********************/

12/02/0600: 46: 50 info namenode. fsnamesystem: fsowner = root, root
12/02/06 00:46:50 info namenode. fsnamesystem: supergroup = supergroup
12/02/06 00:46:50 info namenode. fsnamesystem: ispermissionenabled = true
12/02/06 00:46:50 info common. storage: imagefile of size 94 saved in 0 seconds.
12/02/06 00:46:50 info common. storage: storagedirectory/opt/hadoop/hadoopfs/name1 has been successfully formatted.
12/02/06 00:46:50 info common. storage: imagefile of size 94 saved in 0 seconds.
12/02/06 00:46:50 info common. storage: storagedirectory/opt/hadoop/hadoopfs/name2 has been successfully formatted.
12/02/06 00:46:50 info namenode. namenode: shutdown_msg:
/************************************ * **********************
shutdown_msg: shutting down namenode atv-jiwan-ubuntu-0/127.0.0.1
******************************* ****************************/

Check the output to ensure that the Distributed File System is formatted successfully.
After execution, you can view the/home/hadoop // name1 and/home/hadoop // name2 directories on the master machine. Start hadoop on the master node. The master node starts hadoop on all slave nodes.

4.2 start all nodes

Startup method 1:

$ Bin/start-all.sh (both HDFS and MAP/reduce startup)
System output:

Starting namenode, logging to/usr/local/hadoop/logs/hadoop-hadoop-namenode-ubuntu.out
Node2: Starting datanode, loggingto/usr/local/hadoop/logs/hadoop-hadoop-datanode-ubuntu.out
Node3: Starting datanode, loggingto/usr/local/hadoop/logs/hadoop-hadoop-datanode-ubuntu.out
Node1: Starting secondarynamenode, logging to/usr/local/hadoop/logs/hadoop-hadoop-secondarynamenode-ubuntu.out
Starting jobtracker, logging to/usr/local/hadoop/logs/hadoop-hadoop-jobtracker-ubuntu.out
Node2: Starting tasktracker, logging to/usr/local/hadoop/logs/hadoop-hadoop-tasktracker-ubuntu.out
Node3: Starting tasktracker, logging to/usr/local/hadoop/logs/hadoop-hadoop-tasktracker-ubuntu.out
As you can see in slave's output above, it will automatically format it's storage directory (specified by DFS. data. DIR) if it is not formattedalready. it will also create the directory if it does not exist yet.

After execution, you can view the/home/hadoop/hadoopfs/data1 and/home/hadoop/data2 directories on the server Load balancer (node1, node2) machine.

After the node is started, datanode may not be connected. This is where DFS userd 50070 is displayed in http: // node1: 100%/dfshealth. jsp, and the number of live nodes is zero. This is to check whether the configuration items of localhost or host name corresponding to 127.0.0.1 exist in the/etc/hosts file of the masters and slaves. If yes, delete them, add your own actual IP address and host name pair (do not use localhost to indicate local in hadoop applications in the future, because the corresponding items of localhost have been deleted, you can use your actual host name ). Disable the safemode: hadoop dfsadmin-safemode leave.

Startup method 2:

To start a hadoop cluster, you must start the HDFS cluster and the map/reduce cluster.

On the assigned namenode, run the following command to start HDFS:
$ Bin/start-dfs.sh (start HDFS cluster separately)

The bin/start-dfs.sh script starts the datanode daemon on all listed slave instances with reference to the content of the $ {hadoop_conf_dir}/slaves file on namenode.

On the assigned jobtracker, run the following command to start MAP/reduce:
$ Bin/start-mapred.sh (start MAP/reduce separately)

The bin/start-mapred.sh script starts the tasktracker daemon on all listed slave instances with reference to the content of the $ {hadoop_conf_dir}/slaves file on jobtracker.

4.3 close all nodes

If hadoop is disabled from the master node, hadoop of all slave nodes is disabled on the master node.

$ Bin/stop-all.sh

Logs of the hadoop daemon are written to the $ {hadoop_log_dir} directory (default: $ {hadoop_home}/logs ).

$ {Hadoop_home} is the installation path.

5. Test

1) browse the network interfaces of namenode and jobtracker. Their addresses are:

Namenode-http: // node1: 50070/
Jobtracker-http: // node2: 50030/

3) Use netstat-nat to check whether port 49000 and port 49001 are in use.

4) use JPs to view Processes

To check whether the daemon process is running, run the JPS command (this is applicable to the PS of the JVM process ).Program). This command lists the five daemon and their process identifiers.

5) copy the input file to the Distributed File System:
$ Bin/hadoop FS-mkdir Input
$ Bin/hadoop FS-put CONF/core-site.xml Input

Run the sample program provided by the release:
$ Bin/hadoop jar hadoop-0.20.2-examples.jar grep input output 'dfs [A-Z.] +'

6. Supplement
Q: Bin/hadoop jar hadoop-0.20.2-examples.jar grep input output 'dfs [A-Z.] + 'What do you mean?
A: Bin/hadoop jar (run jar package with hadoop) hadoop-0.20.2_examples.jar (name of jar package) grep (class to be used, followed by the parameter) input Output 'dfs [A-Z.] +'
The entire process is to run grep in the hadoop sample program. The input directory on HDFS is input, and the output directory is output.
Q: What is grep?
A: a map/reduce program that counts the matches of A RegEx in the input.

View the output file:

Copy the output file from the Distributed File System to the local file system:
$ Bin/hadoop FS-Get output
$ Cat output /*

Or

View the output file on the Distributed File System:
$ Bin/hadoop FS-cat output /*

Statistical results:
Root @ v-jiwan-ubuntu-0 :~ /Hadoop/hadoop-0.20.2-bak/hadoop-0.20.2 # bin/hadoop FS-cat output/part-00000
3 DFS. Class
2 DFS. Period
1 DFS. File
1 DFS. Replication
1 DFS. Servers
1 dfsadmin

7. Common HDFS operations

Hadoopdfs-ls: List HDFS files
Hadoop DFS-ls in lists files in a document under HDFS
Hadoop DFS-put test1.txt test: Upload the file to the specified directory and rename it. Only when all datanode has received the data is successful.
Hadoop DFS-get in getin obtains files from HDFS and renames them as getin.CompositionFile directory
Hadoop DFS-RMR out deletes a specified file from HDFS
Hadoop DFS-cat in/* view the content of the In directory on HDFS
View basic HDFS statistics in hadoop dfsadmin-Report. The results are as follows:
Hadoop dfsadmin-safemode leave Exit security mode
Hadoop dfsadmin-safemode enter enters safe Mode 

8. add nodes

Scalability is an important feature of HDFS. First, install hadoop on the newly added node, then modify the $ hadoop_home/CONF/master file, and add the namenode host name, then, modify the $ hadoop_home/CONF/slaves file on the namenode node, add the Host Name of the newly added node, and create an SSH connection to the newly added node without a password.

Run the startup command:

Start-all.sh

Then you can view the newly added datanode through http: // (masternode host name): 50070

9. Server Load balancer

Start-balancer.sh to rebalance the distribution of data blocks on a datanode node by selecting policies

 

Conclusion: It is helpful to check logs/*. Log first when you encounter a problem.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.