Install and deploy Apache Hadoop 2.6.0

Source: Internet
Author: User
Tags ssh server hdfs dfs

Install and deploy Apache Hadoop 2.6.0

Note: For this document, refer to the official documentation for the original article.

1. hardware environment

There are three machines in total, all of which use the linux system. Java uses jdk1.6.0. The configuration is as follows:
Hadoop1.example.com: 172.20.115.1 (NameNode)
Hadoop2.example.com: 172.20.1152 (DataNode)
Hadoop3.example.com: 172.115.20.3 (DataNode)
Hadoop4.example.com: 172.20.115.4
Correct resolution between host and IP address
For Hadoop, in HDFS, nodes are classified into Namenode and Datanode. There is only one Namenode, and there can be many Datanode. In MapReduce, nodes are classified into Jobtracker and Tasktracker.
There is only one Jobtracker in, and there can be many Tasktracker. I deployed namenode and jobtracker on hadoop1, and hadoop2 and hadoop3 as datanode and tasktracker. Of course, you can also
Namenode, datanode, jobtracker, and tasktracker are all deployed on one machine (this is pseudo-distributed ).

2. directory structure


Because Hadoop requires that the directory structure of hadoop deployment on all machines be the same and there is an account with the same user name.
This is the case on all three of my servers: There is a hadoop account, and the main directory is/home/hadoop.
Add User hadoop
# Useradd-u800 hadoop
# Passwd hadoop creates a password for hadoop
Download hadoop-2.6.0.tar.gz
Unzip # tar zxf hadoop-2.6.0.tar.gz
# Music hadoop-2.6.0 // home/hadoop/
# Cd/home/hadoop
# Ln-s hadoop-2.6.0/hadoop
Switch to hadoop user # su-hadoop
Download jdk-6u32-linux-x64.bin home directory
$ Sh jdk-6u32-linux-x64.bin
$ Cd/home/hadoop/
$ Mv jdk1.6.0 _ 32 hadoop-1.2.1/
$ Cd hadoop-2.6.0/
Create a soft link for future updates and upgrades
$ Ln-s jdk1.6.0 _ 32 jdk

Switch to root again
# Chown-R hadoop. hadoop hadoop-2.6.0/

3. SSH settings

After Hadoop is started, Namenode starts and stops various daemon on each node through SSH (Secure Shell, therefore, you do not need to enter a password when executing commands between nodes. Therefore, you need to configure SSH to use the password-free public key authentication method.
First, ensure that the SSH server is installed on each machine and starts properly. In practice, we use OpenSSH, which is a free open-source implementation of the SSH protocol.
Take the three machines in this article as an example. Currently, hadoop1 is the master node and needs to actively initiate an SSH connection to hadoop2. For SSH services, hadoop1 is the SSH client, while hadoop2 and hadoop3 are the SSH server, therefore, on hadoop2 and hadoop3, make sure that the sshd service has been started. To put it simply, you need to generate a key pair on hadoop1, that is, a private key and a public key. Copy the public key to hadoop2. For example, when hadoop1 initiates an ssh connection to hadoop2, a random number is generated on hadoop2 and the public key of hadoop1 encrypts the random number and sends it to hadoop1, hadoop1 decrypts the encrypted number with the private key and sends the decrypted number back to hadoop2. hadoop2 allows hadoop1 to connect after confirming that the decrypted number is correct. This completes a public key authentication process.
For the three machines in this article, first generate the key pair on hadoop1:

# Su-hadoop
$ Ssh-keygen

This command generates a key pair for hadoop on hadoop1. The generated key pairs id_rsa and id_rsa.pub are in the/home/hadoop/. ssh directory.

$ Ssh-copy-id localhost
$ Ssh-copy-id 172.20.115.2
$ Ssh-copy-id 172.20.115.3

Publish the key to your local device and hadoop2 and hadoop3
Try to log on to the local machine and hadoop2 and hadoop3 to check whether there is password verification. If there is no password, the verification is successful.

4. Environment variables (the configuration directory in this version has changed a lot. Please pay attention to it !)

Set the environment changes required for hadoop in the hadoop-2.6.0 under/home/hadoop/hadoop-env.sh/etc/Hadoop/contents, among them, JAVA_HOME is a required variable.
The HADOOP_HOME variable can be set or not set. If not set, HADOOP_HOME defaults to the parent directory of the bin directory, that is, the/home/hadoop in this article.
Vim/home/hadoop/hadoop-2.6.0/etc/hadoop/hadoop-env.sh


Export JAVA_HOME =/home/hadoop/jdk (approximately 25th rows)


Perform a simple test first:

$ Cd/home/hadoop/
$ Mkdir input
$ Cp/etc/hadoop/* input/
$ Bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs [a-z.] +'
$ Cd output
$ Cat *

Words in the statistics file:

$ Bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input test
$ Cd test/
$ Cat *

5. hadoop configuration file

$ Cd/home/hadoop/etc/hadoop/

Configure HDFS

/Core-site.xml:


Configuration>
<Property>
<Name> fs. default. name </name>
<Value> hdfs: // hadoop1.example.com: 9000 </value>
</Property> </configuration>

/Hdfs-site.xml:


<Configuration>
<Property>
<Name> dfs. replication </name>
<Value> 1 </value>
</Property> </configuration>


Pseudo distributed test:

$ Mkdir/home/hadoop/bin
$ Ln-s/home/hadoop/jdk/bin/jps/home/hadoop/bin/
$ Cd/home/hadoop/
$ Sbin/hdfs namenode-format initialize first
$ Sbin/start-dfs.sh

The hadoop daemon log output is written to the $ HADOOP_LOG_DIR directory (defaults to $ HADOOP_HOME/logs). (write to The log file)

Web Test http: // 172.20.115.1: 50070/

$ Bin/hdfs dfs-mkdir/user
$ Bin/hdfs dfs-mkdir/user/<username>
$ Bin/hdfs dfs-put etc/hadoop input
$ Bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs [a-z.] +'
$ Bin/hdfs dfs-get output
$ Cat output /*

Configure YARN

Cd/home/hadoop/

Etc/hadoop/mapred-site.xml:

 

<Configuration>
<Property>
<Name> mapreduce. framework. name </name>
<Value> yarn </value>
</Property>
</Configuration>

Etc/hadoop/yarn-site.xml:


<Configuration>
<Property>
<Name> yarn. nodemanager. aux-services </name>
<Value> mapreduce_shuffle </value>
</Property>
</Configuration>


Start ResourceManager daemon and NodeManager daemon:

$ Sbin/start-yarn.sh

Access: http: // 172.20.115.1: 8088

If this effect is achieved, it means that you have successfully deployed the pseudo-distributed


6. Deploy a Hadoop Cluster

As mentioned above, the environment variables and configuration files of Hadoop are on the hadoop1 machine. Now we need to deploy hadoop on other machines to ensure the directory structure is consistent.

$ Scp-r/home/hadoop hadoop2.example.com:/home/hadoop/

$ Scp-r/home/hadoop hadoop3.example.com:/home/hadoop/

$ Scp-r. ssh/hadoop2.example.com:

$ Scp-r. ssh/hadoop3.example.com:

Note:

$ Cd/home/hadoop/etc/hadoop

/Masters

Hadoop1.example.com

/Slaves

Hadoop2.example.com
Hadoop3.example.com

$ Ln-s hadoop-1.2.1/hadoop
$ Mkdir/home/hadoop/bin
$ Ln-s/home/hadoop/jdk/bin/jps/home/hadoop/bin
So far, we can say that Hadoop has been deployed on various machines. Let's start Hadoop.

7. Start Hadoop

Before starting, we need to format the namenode first and enter ~ /Hadoop/directory, execute the following command:

$ Bin/hadoop namenode-format

The format is successful. If it fails, go to the hadoop/logs/directory to view the log file.
Now we should officially start hadoop. There are a lot of startup scripts in sbin/, which can be started as needed.
* The start-all.sh starts all Hadoop daemon. Including namenode, datanode, jobtracker, tasktrack
* Stop-all.sh stops all Hadoop
* The start-mapred.sh starts the Map/Reduce daemon. Including Jobtracker and Tasktrack
* Stop-mapred.sh stops Map/Reduce daemon

* The start-dfs.sh starts the Hadoop DFS daemon. Namenode and Datanode
* Stop-dfs.sh stops DFS daemon

Here, we simply start all the daemons:
[Hadoop @ hadoop1: hadoop] $ sbin/start-all.sh

$ Jps

Check whether JobTracker, Jps, SecondaryNameNode, and NameNode are successfully started.
Similarly, if you want to stop hadoop

[Hadoop @ hadoop1: hadoop] $ sbin/stop-all.sh

8. HDFS operations

Run the hadoop command in the sbin/directory to view all the operations supported by Haoop and their usage. Here we take several simple operations as an example.
Create a directory:

[Hadoop @ hadoop1 hadoop] $ sbin/hadoop dfs-mkdir testdir

Create a directory named testdir in HDFS and copy the file:

[Hadoop @ hadoop1 hadoop] $ sbin/hadoop dfs-put/home/large.zip testfile.zip

Copy the local file large.zip to the root directory of HDFS/user/hadoop/. The file name is testfile.zip. view the existing files:

[Hadoop @ hadoop1 hadoop] $ sbin/hadoop dfs-ls

9. hadoop online update node:

Add nodes:

1). Install jdk on the new node and create the same hadoop user and uid to maintain consistency.
2) Add the ip address of the new node to the conf/slaves file.
3) synchronize all hadoop data on the master to the new node, and ensure the path is consistent.
4). Start the service on the new node:

$ Sbin/hadoop-daemon.sh start datanode
$ Sbin/hadoop-daemon.sh start tasktracker

5). Balanced data:

$ Sbin/start-balancer.sh

(1) If balancing is not performed, the cluster will store new data on the new datanode, which will reduce the efficiency of mapred.
(2) set the balance threshold. The default value is 10%. The lower the value, the more balanced the nodes, but the longer the consumption time.

$ Sbin/start-balancer.sh-threshold 5

Delete A datanode node online:

1). Modify/mapred-site.xml on master

<Property>

<Name> dfs. hosts. exclude </name>
 
<Value>/home/hadoop/hadoop-1.2.1 // datanode-excludes </value>
 
</Property>
 

2) create a datanode-excludes file and add the host to be deleted, one row

172.20.115.4

3). Refresh the node online on the master.

$ Sbin/hadoop dfsadmin-refreshNodes

This operation will migrate data in the background. When the status of this node is displayed as Decommissioned, you can close it safely.

4) You can use the following command to view the datanode status

$ Sbin/hadoop dfsadmin-report

During data migration, this node should not be involved in tasktracker; otherwise, it will often appear.

Online deletion of tasktracker nodes:

1). Modify/mapred-site.xml on master

<Property>
 
<Name> mapred. hosts. exclude </name>
 
& Lt; value & gt;/home/hadoop/hadoop-1.2.1/etc/hadoop/tasktracker-excludes & lt;/value & gt;
 
</Property>
 

2. Create the tasktracker-excludes file and add the host name to be deleted.

Hadoop4.example.com

3. Refresh nodes online on the master node

$ Sbin/hadoop mradmin-refreshNodes

4. log on to the network interface of jobtracker to view the information.

 

-------------------------------------- Split line --------------------------------------

Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Build a Hadoop environment (use virtual machines to virtualize two Ubuntu systems in a Winodws Environment

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.