Hadoop installation Configuration

Source: Internet
Author: User
Tags xsl ssh server
Recently, the company has taken over a new project and needs to perform distributed crawling on the entire wireless network of the company. The webpage index is updated and the PR value is calculated. Because the data volume is too large (tens of millions of data records ), you have to perform distributed processing. The new version is ready to adopt the hadoop architecture. The general process of hadoop configuration and precautions are described, reposted from other people's articles (later articles I will focus on some of the problems I encountered in the configuration process, as a small summary) http://www.cnblogs.com/wayne1017/archive/2007/03/20/678724.html

This document uses hadoop installation and usage as an example to describe the problems that may be encountered during hadoop deployment and how to solve them.

Hardware environment
There are three machines in total, all of which use the fc5 system. Java uses jdk1.6.0. The IP configuration is as follows:
Dbrg-1: 202.197.18.72
Dbrg-2: 202.197.18.73
Dbrg-3: 202.197.18.74

It is important to make sure that the host name and IP address of each machine can be correctly resolved.

A very simple test method is to ping the host name, such as ping on the dbrg-1
Dbrg-2, if you can ping OK! If the hosts cannot be correctly parsed, you can modify the/etc/hosts file. If the hosts are used as namenode, you need to add the IP addresses of all hosts in the cluster and their corresponding host names in the hosts file; if this machine is used as a datanode, you only need to add the local IP address and the IP address of the namenode machine in the hosts file.

Take this article as an example, the/etc/hosts file in the dbrg-1 should look like this:
127.0.0.0
Localhost
202.197.18.72 dbrg-1
Dbrg-1
202.197.18.73 dbrg-2 dbrg-2
202.197.18.74
Dbrg-3 dbrg-3

The/etc/hosts file in the dbrg-2 should look like this:
127.0.0.0
Localhost
202.197.18.72 dbrg-1
Dbrg-1
202.197.18.73 dbrg-2 dbrg-2

As mentioned in the previous study note, for hadoop, in HDFS, nodes can be divided into namenode and datanode, with only one namenode and many datanode. In mapreduce's view, nodes are divided into jobtracker and tasktracker. There is only one jobtracker, and there are many tasktracker nodes.
I deploy namenode and jobtracker ON THE dbrg-1, dbrg-2, dbrg-3 as datanode and tasktracker. You can also deploy namenode, datanode, jobtracker, and tasktracker on one machine.

Directory structure
Because hadoop requires that the directory structure of hadoop deployment on all machines be the same and there is an account with the same user name.
On all three of my machines, there is a dbrg account. The main directory is/home/dbrg.
The hadoop deployment directory structure is as follows:/home/dbrg/hadoopinstall. All hadoop versions are stored in this directory.
Decompress the hadoop0.12.0 package to hadoopinstall. To facilitate future upgrades, we recommend that you set a link to the hadoop version to be used.
[Dbrg @ dbrg-1: hadoopinstall] $ ln
-S hadoop0.12.0
Hadoop
In this way, all the configuration files are in the/hadoop/CONF/directory, and all the execution programs are in the/hadoop/bin directory.
However, because the hadoop configuration files in the preceding directory are put together with the hadoop installation directory, all the configuration files will be overwritten once the hadoop version is upgraded in the future, therefore, we recommend that you separate the configuration file from the installation directory. A better method is to create a directory for storing the configuration file,/home/dbrg/hadoopinstall/hadoop-config /, copy the hadoop_site.xml, slaves, and hadoop_env.sh files in the/hadoop/CONF/directory to the hadoop-config/directory.
Started
In hadoop, you only need to copy the three files to the directory you created, however, during actual configuration, I found that the Masters file must also be copied to the hadoop-CONF/directory, otherwise, an error will be reported when hadoop is started, indicating that the file "Masters" cannot be found), and the environment variable $ hadoop_conf_dir is specified to point to this directory. Environment variables are set in/home/dbrg/. bashrc and/etc/profile.
To make it easier to upgrade the version later, we need to separate the configuration file from the installation directory and set a link to the version of hadoop we want to use, this reduces the maintenance of the configuration file. In the following sections, you will experience the benefits of such separation and links.

SSH settings
After hadoop is started, namenode uses SSH (secure
Shell) to start and stop various Daemon Processes on each node. Therefore, you do not need to enter a password when executing commands between nodes, therefore, we need to configure SSH to use the password-free public key authentication method.
First, ensure that the SSH server is installed on each machine and starts properly. In practice, we use OpenSSH, which is a free open-source implementation of the SSH protocol. The default OpenSSH version installed in fc5 is openssh4.3p2.
Take the three machines in this article as an example, now the dbrg-1 is the master node, it needs to actively initiate SSH to connect to the dbrg-2 and dbrg-3, for the SSH service, the dbrg-1 is the SSH client, the dbrg-2 and dbrg-3 are the SSH server, so on the dbrg-2, The dbrg-3 needs to determine that the sshd service is started. Simply put, a key pair, namely a private key and a public key, needs to be generated on the dbrg-1. Copy the public key to the dbrg-2, on the dbrg-3, so that, for example, when the dbrg-1 initiates an SSH connection to the dbrg-2, a random number is generated on the dbrg-2 and encrypted with the public key of the dbrg-1, sent to the dbrg-1; The dbrg-1 decrypts the encrypted number with the private key after receiving it, and sends the decrypted number back to the dbrg-2, the dbrg-2 allows the dbrg-1 to connect after confirming that the number of decryption is correct. This completes a public key authentication process.

For the three machines in this article, first generate the key pair on the dbrg-1:
[Dbrg @ dbrg-1: ~] $ Ssh-keygen
-T
RSA
This command will generate its key pair for the user dbrg on the dbrg-1, asking them to save the Path Directly press ENTER using the default path, when prompted to enter passphrase for the generated key, press enter directly, that is, set it to an empty password. The generated key pairs id_rsa and id_rsa.pub are stored in the/home/dbrg/. Ssh directory by default. Copy the content of id_rsa.pub to the/home/dbrg/of each machine (including the local machine /. in the ssh/authorized_keys file, if there is already an authorized_keys file on the machine, add the content in id_rsa.pub to the end of the file. If there is no authorized_keys file, simply CP or SCP, the following Operation assumes that no authorized_keys file exists on each machine.

For dbrg-1
[Dbrg @ dbrg-1:. Ssh] $ CP id_rsa.pub
Authorized_keys

For dbrg-2 (dbrg-3 with dbrg-2 method)
[Dbrg @ dbrg-2: ~] $ Mkdir
. SSH
[Dbrg @ dbrg-1:. Ssh] $ SCP authorized_keys
Dbrg-2:/home/dbrg/. Ssh/
SCP here is remote copy through SSH, here you need to enter the password for the remote host, that is, the password for the dbrg account on the dbrg-2 machine, of course, you can also use other methods to copy the authorized_keys file to another machine.

[Dbrg @ dbrg-2:. Ssh] $ chmod 644 authorized_keys
This step is critical. You must ensure that authorized_keys only has read and write permissions on its owner, and others do not allow write permissions. Otherwise, SSH will not work. I have been depressed for a long time When configuring ssh.

[Dbrg @ dbrg-2:. Ssh] ls-la
Drwx ------ 2 dbrg
.
Drwx ------ 3 dbrg ..
-RW-r -- 1 dbrg
Authorized_keys
Note that the ls of the. Ssh directory on each machine
-La should be the same as above.

Next, you need to configure the sshd service on all three machines (in fact, you don't need to configure it. After completing the above operations, SSH will be ready to work ), modify the file/etc/ssh/sshd_config on three machines
# Remove Password Authentication
Passwordauthentication
No
Authorizedkeyfile. Ssh/authorized_keys

Now the SSH configuration on each machine has been completed, you can test it, such as the dbrg-1 to initiate an SSH connection to the dbrg-2
[Dbrg @ dbrg-1: ~] $ SSH
Dbrg-2
If SSH is configured, the following message is displayed:
The authenticity of host [dbrg-2] can't be
Established.
Key fingerprint is 1024
5f: A0: 0b: 65: D3: 82: DF: AB: 44: 62: 6d: 98: 9C: Fe: E9: 52.
Are you sure you want
Continue connecting
(Yes/No )?
OpenSSH tells you that it does not know this host, but you do not have to worry about this problem, because it is the first time you log on to this host. Type "yes ". This will add the "recognition mark" of this host to "~ /. Ssh/know_hosts "file. This prompt is no longer displayed when you access this host for the second time.
Then you will find that you can establish an SSH connection without entering the password. Congratulations, the configuration is successful.
But don't forget to test the local SSH dbrg-1

Hadoop Environment Variables
Set the environment variables required by hadoop in hadoop_env.sh under the/home/dbrg/hadoopinstall/hadoop-conf directory. java_home is a required variable. The hadoop_home variable can be set or not set. If not set, hadoop_home defaults to the parent directory of the bin directory, that is,/home/dbrg/hadoopinstall/hadoop in this article. My settings are as follows:
Export
Hadoop_home =/home/dbrg/hadoopinstall/hadoop
Export
Java_home =/usr/Java/jdk1.6.0
From this point, we can see the advantages of the hadoop link created for hadoop0.12.0 described earlier. When you update the hadoop version in the future, you do not need to modify the configuration file, you only need to change the link.

Hadoop configuration file
As mentioned above, open the slaves file in the hadoop-CONF/directory, which is used to specify all slave nodes and specify a host name in one line. That's the dbrg-2, dbrg-3 in this article, so the slaves file looks like this
Dbrg-2
Dbrg-3
The hadoop-default.xml in the conf/directory contains all the configuration items for hadoop, but cannot be modified directly! You can define what we need in the hadoop-site.xml under the hadoop-CONF/directory, and its value overwrites the default value in the hadoop-default.xml. You can customize it based on your actual needs. The following is my configuration file:
<? XML
Version = "1.0"?>
<? XML-stylesheet type = "text/XSL"
Href = "configuration. XSL"?>
<! -- Put site-specific property overrides
In this file. -->
<Configuration>
<Property>
 
<Name> fs. Default. Name </Name>
 
<Value> dbrg-1: 9000 </value>
<Description> The Name Of
Default file system. Either the literal string "local" or a host: port
DFS. </description>
</Property>
<Property>
 
<Name> mapred. Job. Tracker </Name>
 
<Value> dbrg-1: 9001 </value>
<Description> the host and Port
That the mapreduce job tracker runs at. If "local", then jobs are run in-process
As a single map and reduce
Task. </description>
</Property>
<Property>
 
<Name> hadoop. tmp. dir </Name>
 
<Value>/home/dbrg/hadoopinstall/tmp </value>
 
<Description> a base for other temporary
Directories. </description>
</Property>
<Property>
 
<Name> DFS. Name. dir </Name>
 
<Value>/home/dbrg/hadoopinstall/filesystem/name </value>
 
<Description> determines where on the local filesystem the DFS Name Node
Shocould store the name table. If this is a comma-delimited list of directories
Then the name table is replicated in all of the directories, for redundancy.
</Description>
</Property>
<Property>
 
<Name> DFS. Data. dir </Name>
 
<Value>/home/dbrg/hadoopinstall/filesystem/Data </value>
 
<Description> determines where on the local filesystem an DFS Data Node
Shocould store its blocks. If this is a comma-delimited list of directories, then
Data will be stored in all named directories, typically on different devices.
Directories that do not exist are
Ignored. </description>
</Property>
<Property>
 
<Name> DFS. Replication </Name>
<Value> 1 </value>
 
<Description> default block replication. The actual number of replications
Can be specified when the file is created. The default is used if replication is
Not specified in create
Time. </description>
</Property>
</Configuration>


Deploy hadoop
As mentioned above, the environment variables and configuration files of hadoop are on the dbrg-1 machine. Now we need to deploy hadoop on other machines to ensure the directory structure is consistent.
[Dbrg @ dbrg-1: ~] $ SCP-R
/Home/dbrg/hadoopinstall dbrg-2:/home/dbrg/
[Dbrg @ dbrg-1: ~] $ SCP-R
/Home/dbrg/hadoopinstall
Dbrg-3:/home/dbrg/
So far, we can say that hadoop has been deployed on various machines. Now let's start hadoop.

Start hadoop
Before starting, we need to format the namenode first and enter ~ /Hadoopinstall/hadoop directory, execute the following command
[Dbrg @ dbrg-1: hadoop] $ bin/hadoop
Namenode
-Format
The format is successful. If it fails, go to the hadoop/logs/directory to view the log file.
Now we should officially start hadoop. There are a lot of startup scripts in Bin/, which can be started as needed.
* The start-all.sh starts all hadoop daemon. Including namenode, datanode,
Jobtracker, tasktrack
* Stop-all.sh stops all hadoop
* Start-mapred.sh
Start the MAP/reduce daemon. Including jobtracker and tasktrack
* Stop-mapred.sh stops MAP/reduce daemon
*
Start-dfs.sh starts hadoop DFS daemon. namenode and datanode
* Stop-dfs.sh
Stop DFS daemon

Here, we simply start all the daemons
[Dbrg @ dbrg-1: hadoop] $ bin/start-all.sh

Similarly, if you want to stop hadoop
[Dbrg @ dbrg-1: hadoop] $ bin/stop-all.sh

HDFS operations
Run the hadoop command in the bin/directory to view all the operations supported by haoop and their usage. Here we take several simple operations as an example.

Create directory
[Dbrg @ dbrg-1: hadoop] $ bin/hadoop DFS-mkdir
Testdir
Create a directory named testdir in HDFS

Copy a file
[Dbrg @ dbrg-1: hadoop] $ bin/hadoop DFS-put
/Home/dbrg/large.zip
Testfile.zip
Copy the ghost file large.zip to the root directory of HDFS/user/dbrg/. The file name is testfile.zip.

View existing files
[Dbrg @ dbrg-1: hadoop] $ bin/hadoop DFS-ls

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.