Hadoop installation, configuration, and Solution

Source: Internet
Author: User
Tags ssh server

Many new users have encountered problems with hadoop installation, configuration, deployment, and usage for the first time. This article is both a test summary and a reference for most beginners (of course, there are a lot of related information online ).

 

Hardware environment
There are two machines in total, one (as a Masters), one machine uses the VM to install two systems (as slaves), and all three systems use the Ubuntu 11.04 system, java uses jdk1.6.0 _ 25. The IP configuration is as follows:

 

VM network connection method: both are 'bridging'

Frank-1 (host name: Hostname): 192.168.0.100 masters ----- namenode
Frank-2 (host name: Hostname): 192.168.0.102 slaves ------- datanode

 

 

Frank-3 (host name: Hostname): 192.168.0.103 slaves ------- datanode

 

Modify the hostname host name in Linux:

Not easy to use:

$ Hostname host name // This setting is only temporary modification. When the system is restarted, it will be restored to the original host name.

Using Smit on the Internet --- it seems that it is not a Linux Command (not detailed)

The simplest and most stupid method -- directly modify:/etc/hostname

 

In addition, we need to make sure that the host name and IP address of each machine can be correctly resolved.


 

A simple test method is to ping the host name, for example, Ping Frank-2 on Frank-1. If you can ping it, OK! If the hosts cannot be correctly parsed, you can modify the/etc/hosts file. If the hosts are used as namenode, you need to add the IP addresses of all hosts in the cluster and their corresponding host names in the hosts file; if this machine is used as a datanode, you only need to add the local IP address and the IP address of the namenode machine in the hosts file.

 

Taking this article as an example, the/etc/hosts file in Frank-1 looks like this:
127.0.0.0 localhost
192.168.0.100 Frank-1 frank-1
192.168.0.102 Frank-2 Frank-2
192.168.0.103 Frank-3 Frank-3

The/etc/hosts file in Frank-2 looks like this:
127.0.0.0 localhost
192.168.0.100 Frank-1 frank-1
192.168.0.102 Frank-2 Frank-2

The/etc/hosts file in Frank-3 looks like this:
127.0.0.0 localhost
192.168.0.100 Frank-1 frank-1
192.168.0.103 Frank-3 Frank-3

 

For hadoop, in HDFS, nodes can be divided into namenode and datanode, with only one namenode and many datanode. In mapreduce, nodes can be divided into jobtracker and tasktracker, with only one jobtracker, tasktracker can be many.
I deployed namenode and jobtracker on Frank-1, Frank-2, and Frank-3 as datanode and tasktracker. You can also deploy namenode, datanode, jobtracker, and tasktracker on one machine.

 

Required software:

Jdk-6u25-linux-i586.bin

Hadoop-0.21.0.tar.gz(And hbase integration, to select the hadoop-0.20.x version, the test hadoop-0.20.0 can not, the hadoop-0.20.1 is not tested, it is recommended to use the hadoop-0.20.2, the test is successful)

OpenSSH

 

 

 

Directory structure


Because hadoop requires that the directory structure of hadoop deployment on all machines be the same and there is an account with the same user name.
On the three machines, there is a frank account. The main directory is/home/Frank.
The hadoop deployment directory structure is as follows:/home/dbrg/hadoopinstall. All hadoop versions are stored in this directory.
Decompress the hadoop0.21.0 package to hadoopinstall. To facilitate future upgrades, we recommend that you set a link to the hadoop version to be used.
$ Ln-s hadoop0.21.0 hadoop
In this way, all the configuration files are in the/hadoop/CONF/directory, and all the execution programs are in the/hadoop/bin directory.
However, because the hadoop configuration files in the preceding directory are put together with the hadoop installation directory, all the configuration files will be overwritten once the hadoop version is upgraded in the future, therefore, we recommend that you separate the configuration file from the installation directory. A better method is to create a directory for storing the configuration file,/home/dbrg/hadoopinstall/hadoop-config /, then copy the core-site.xml, hdfs-site.xml, mapred-site.xml, masters, slaves, hadoop_env.sh files in the/hadoop/CONF/directory to the hadoop-config/directory, and specify the environment variable $ hadoop_conf_dir to point to this directory. Environment variables are set in/home/Frank/. bashrc and/etc/profile.

 

To make it easier to upgrade the version later, we need to separate the configuration file from the installation directory and set a link to the version of hadoop we want to use, this reduces the maintenance of the configuration file. In the following sections, you will experience the benefits of such separation and links.

 

 

SSH installation and Setup


After hadoop is started, namenode starts and stops various daemon on each node through SSH (Secure Shell, therefore, you do not need to enter a password when executing commands between nodes. Therefore, you need to configure SSH to use the password-free public key authentication method.
First, ensure that the SSH server is installed on each machine and starts properly. In practice, we use OpenSSH, which is a free open-source implementation of the SSH protocol. SSH installation is simple, $ sudo apt-Get install OpenSSH-Server
Take the three machines in this article as an example. Now, Frank-1 is the master node. It needs to initiate an SSH connection to Frank-2 and Frank-3. For the SSH service, Frank-1 is the SSH client, frank-2 and Frank-3 are SSH servers. Therefore, on Frank-2 and Frank-3, make sure that the sshd service is started, it is also started with the system startup. You can use $ netstat-NTL to view the network status. Ssh default port 22 ).

To put it simply, a key pair must be generated on Frank-1, namely, a private key and a public key. Copy the public key to Frank-2 and Frank-3. For example, when Frank-1 initiates an SSH connection to Frank-2, on Frank-2, a random number is generated and the random number is encrypted with the public key of Frank-1 and sent to Frank-1; frank-1 decrypts the encrypted number with the private key, and sends the decrypted number back to Frank-2, when Frank-2 confirms that the number of decrypted data is correct, it allows Frank-1 to connect. This completes a public key authentication process.

For the three machines in this article, first generate a key pair on Frank-1 that does not require a logon password:
$ Ssh-keygen-t dsa-p'-f ~ /. Ssh/id_dsa
This command will generate a key pair for user Frank on Frank-1. The generated key pair id_dsa and id_dsa.pub are stored in the/home/Frank/. Ssh directory by default. Then, append the content of id_dsa.pub to the/home/Frank/. Ssh/authorized_keys file of each machine (including the local machine), that is, the following operations:

For Frank-1
$ Cat ~ /. Ssh/id_dsa.pub> ~ /. Ssh/authorized_keys

$ SCP ~ /. Ssh/id_dsa.pub Frank-3:/home/Frank

For Frank-2 (Frank-3 is the same as Frank-2)
$ Mkdir. Ssh ------------------ first go to Frank-2 and create the. Ssh folder.
$ SCP ~ /. Ssh/authorized_keys Frank-2:/home/Frank/. SSH
SCP is used for remote copy through SSH. You need to enter the password of the remote host, that is, the password of the frank account on Frank-2. Of course, you can also use other methods to copy the authorized_keys file to another machine.

Then, configure the sshd service on all three machines (In fact, you don't need to configure it. After completing the above operations, SSH will be ready to work.), Modify the file/etc/ssh/sshd_config on the three hosts.
# Remove the password before authentication #
Passwordauthentication No
Authorizedkeyfile % H/. Ssh/authorized_keys

Now the SSH configuration on each machine has been completed. You can test it. For example, Frank-1 initiates an SSH connection to Frank-2.
$ SSH Frank-2
If SSH is configured, the following message is displayed:
The authenticity of host [Frank-2] Can't be established.
Key fingerprint is 1024 5f: A0: 0b: 65: D3: 82: DF: AB: 44: 62: 6d: 98: 9C: Fe: E9: 52.
Are you sure you want to continue connecting (Yes/No )?
OpenSSH tells you that it does not know this host, but you do not have to worry about this problem, because it is the first time you log on to this host. Type "yes ". This will add the "recognition mark" of this host to "~ /. Ssh/know_hosts "file. This prompt is no longer displayed when you access this host for the second time.
Then you will find that you can establish an SSH connection without entering the password. Congratulations, the configuration is successful.
However, do not forget to test the local SSH Frank-1

 

JDK installation is simple and will not be described. The key is not to forget to configure the environment variable:/etc/profile (JDK must be configured for namenode datanode)

 

# Set Java environment

Export java_home =/home/Frank/javainstall/JDK

Export jre_home = $ java_home/JRE

Export classpath =.: $ java_home/lib: $ jre_home/lib: $ classpath

Export Path =.: $ java_home/bin: $ jre_home/bin: $ path

 

 

Above ---- umask 022

 

 

Hadoop Environment Variables(Namenode datanode must be configured)

 


/Etc/profile: Add:

# Set hadoop Environment

Export hadoop_home =/home/Frank/hadoopinstall/hadoop

Export Path =.: $ hadoop_home/bin: $ path

 

/Home/Frank/hadoopinstall/hadoop-CONF/hadoop_env.sh

Remove # and modify java_home

 

# The JAVA Implementation to use. required.

Export java_home =/home/Frank/javainstall/JDK

 

 

 

 

Hadoop configuration file

 

Only namenode needs to modify masters and slaves (this article is Frank-1 ):

 

MASTERS:

Frank-1

 

Slaves:

Frank-2

Frank-3

 

 

Core-site.xml, hdfs-site.xml, mapred-site.xml and other files can be configured according to specific needs, below is my simple configuration

Core-site.xml configuration is as follows:

 

<? XML version = "1.0"?>

<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>

 

<! -- Put site-specific property overrides in this file. -->

 

<Configuration>

<! -- Global properties -->

<Property>

<Name> hadoop. tmp. dir </Name>

<Value>/home/Frank/hadoopinstall/tmp </value>

<Description> a base for other temporary directories. </description>

</Property>

<! -- File System Properties -->

<Property>

<Name> fs. Default. Name </Name>

<Value> HDFS: // Frank-1: 9000 </value>

<Description> the name of the default file system. Either the literal string "local" or a host: port for DFS. </description>

</Property>

</Configuration>

 

 

Hdfs-site.xml configuration is as follows:

 

<? XML version = "1.0"?>

<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>

 

<! -- Put site-specific property overrides in this file. -->

 

<Configuration>

<Property>

<Name> DFS. Replication </Name>

<Value> 1 </value>

<Description> default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description>

</Property>

</Configuration>

 

 

Mapred-site.xml configuration is as follows:

 

<? XML version = "1.0"?>

<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>

 

<! -- Put site-specific property overrides in this file. -->

 

<Configuration>

<Property>

<Name> mapred. Job. Tracker </Name>

<Value> Frank-1: 9001 </value>

<Description> the host and port that the mapreduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description>

</Property>

</Configuration>

 

 


Now you need to deploy hadoop on other machines to ensure the directory structure is consistent.
$ SCP-r/home/Frank/hadoopinstall Frank-2:/home/Frank/
$ SCP-r/home/Frank/hadoopinstall Frank-3:/home/Frank/

After copying the file, you need to modify the configured masters and slaves, delete hadoop, and re-establish the link.

So far, we can say that hadoop has been deployed on various machines. Now let's start hadoop.

Start hadoop
Before starting, we need to format the namenode first and enter ~ /Hadoopinstall/hadoop directory, execute the following command
$ Hadoop namenode-format

Here, we simply start all the daemons
$ Start-all.sh

Similarly, if you want to stop hadoop
$ Stop-all.sh

 

HDFS operations

View existing files
$ Hadoop DFS-ls/other operations can be found on the Internet

 

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.