This article is mainly about installing and using hadoop-0.12.0 as an example, pointing out the problems that are easy to meet when you deploy Hadoop and how to solve it.
Hardware environment
A total of 3 machines, all using the FC5 system, Java is using jdk1.6.0. The IP configuration is as follows:
dbrg-1:202.197.18.72
dbrg-2:202.197.18.73
dbrg-3:202.197.18.74
One thing to emphasize here is that it is important to ensure that each machine's hostname and IP address are resolved correctly.
A very simple test is to ping the host name, such as Ping dbrg-2 on the dbrg-1, if you can ping the ok! If the correct resolution, you can modify the Hosts file, if the machine for namenode use, you need to add in the Hosts file all the machines in the cluster IP address and its corresponding host name; If the machine is Datanode, You only need to add the native IP address and the IP address of the Namenode machine to the Hosts file.
For example, the Hosts file in dbrg-1 should look like this:
127.0.0.0 localhost localhost
202.197.18.72 dbrg-1 dbrg-1
202.197.18.73 dbrg-2 dbrg-2
202.197.18.74 dbrg-3 dbrg-3
The Hosts file in dbrg-2 should look like this:
127.0.0.0 localhost localhost
202.197.18.72 dbrg-1 dbrg-1
202.197.18.73 dbrg-2 dbrg-2
As mentioned in the previous study note, for Hadoop, in HDFs's view, nodes are divided into Namenode and Datanode, where Namenode only one, Datanode can be many; Nodes are divided into Jobtracker and Tasktracker, of which jobtracker only one, Tasktracker can be a lot.
I was deploying Namenode and Jobtracker on Dbrg-1, dbrg-2,dbrg-3 as Datanode and Tasktracker. Of course, you can also deploy Namenode,datanode,jobtracker,tasktracker to a single machine.
Directory Structure
because Hadoop requires the same deployment directory structure for Hadoop on all machines and has an account with the same user name.
My three machines are like this: there is a DBRG account, the home directory is/HOME/DBRG
the Hadoop deployment directory structure is as follows:/home/dbrg/hadoopinstall, all versions of Hadoop are placed in this directory.
to extract the hadoop0.12.0 compression pack into Hadoopinstall for easy upgrades later, it is recommended that you set up a link to the version of Hadoop you want to use, and set it to Hadoop
[Dbrg@dbrg-1:hadoopinstall ] $LN -s hadoop0.12.0 Hadoop
This way, all of the configuration files are in the/hadoop/conf/directory, and all the execution programs are in the/hadoop/bin directory.
However, because the configuration file for Hadoop in the above directory is set together with the installation directory for Hadoop, it is recommended that the configuration file be separated from the installation directory once all profiles are overwritten when the Hadoop version is later upgraded. A better approach would be to create a directory of configuration files,/home/dbrg/hadoopinstall/hadoop-config/, and then/hadoop/conf/the HADOOP_ in the directory Site.xml,slaves,hadoop_env.sh three files to the hadoop-config/directory (this is a strange question, getting started with Hadoop on the official web Say is only need to copy this three files to the directory of their own creation, but I found in the actual configuration must also be masters this file also copied to the hadoop-conf/directory, or else when you start Hadoop will be the error that can not find Masters this file, and specifies that the environment variable $hadoop_conf_dir points to the directory. Environment variables are set in/HOME/DBRG/.BASHRC and/etc/profile.
To sum up, to facilitate later upgrades, we need to separate the configuration file from the installation directory and, by setting a link to the version of Hadoop we want to use, we can reduce the maintenance of our configuration files. In the following sections, you will experience the benefits of such separation and links.
SSH Settings
When Hadoop is started, Namenode starts and stops various daemons on each node via SSH (Secure Shell), which requires that you do not need to enter a password when executing instructions between nodes. So we need to configure the way SSH uses a password-free public key authentication. The
first ensures that the SSH server is installed on each machine and starts normally. In practice, we use OpenSSH, which is a free open source implementation of the SSH protocol. The OpenSSH version of the default installation in FC5 is OPENSSH4.3P2.
Take the three machines in this article for example, now Dbrg-1 is the master node, it needs to initiate an SSH connection to dbrg-2 and dbrg-3, for SSH services, dbrg-1 is the SSH client, and Dbrg-2, dbrg-3 is the SSH server, It is therefore necessary to determine on the dbrg-2,dbrg-3 that the SSHD service has started. Simply put, a key pair, a private key and a public key, need to be generated on the dbrg-1. Copy the public key to the dbrg-2,dbrg-3, so that, for example, when the dbrg-1 initiates an SSH connection to dbrg-2, a random number is generated on dbrg-2 and the random number is encrypted with the Dbrg-1 public key and sent to the Dbrg-1 Dbrg-1 received the encrypted number after the decryption with the private key, and the number of decrypted sent back to Dbrg-2,dbrg-2 to confirm the number of decrypted after the error allows dbrg-1 to connect. This completes a public key authentication process.
For the three machines in this article, first generate the key pair on the dbrg-1:
[dbrg@dbrg-1:~] $ssh-keygen-t RSA
This command will generate a key pair for the user dbrg on the dbrg-1, asking them to return directly to the default path when they save the path, and when prompted to enter passphrase for the generated key, direct return, that is, set it to a blank password. The generated key pair id_rsa,id_rsa.pub, which is stored in the/HOME/DBRG/.SSH directory by default. The contents of the id_rsa.pub are then copied to the/home/dbrg/.ssh/authorized_keys file of each machine (also including the native), and if the Authorized_keys file is already on the machine, add id_ to the end of the file. Rsa.pub content, if not authorized_keys this file, direct CP or SCP, the following operation assumes that there are no Authorized_keys files on each machine.
For dbrg-1
[Dbrg@dbrg-1:.ssh] $CP id_rsa.pub Authorized_keys
For Dbrg-2 (dbrg-3 with Dbrg-2 method)
[dbrg@dbrg-2:~] $mkdir. SSH
[Dbrg@dbrg-1:.ssh] $SCP Authorized_keys dbrg-2:/home/dbrg/.ssh/
The SCP here is remote copy via SSH, where you need to enter the password for the remote host, that is, the password for the DBRG account on the dbrg-2 machine, and you can, of course, copy authorized_keys files to other machines in other ways.
[Dbrg@dbrg-2:.ssh] $chmod 644 Authorized_keys
This step is critical and you must ensure that Authorized_keys only has read and write access to its owner, and that other people do not allow write permission, or SSH will not work. I have been in the configuration ssh when the depressed for a long time.
[Dbrg@dbrg-2:.ssh]ls-la
DRWX------2 DBRG DBRG.
DRWX------3 DBRG DBRG.
-rw-r--r--1 DBRG DBRG Authorized_keys
Note that the ls-la of the. SSH directory on each machine should be the same as the above
Then, on the three machines are required to configure the SSHD service (in fact, can not be configured, complete the above operations after the SSH can already work), on the three machines to modify the file/etc/ssh/sshd_config
#去除密码认证
Passwordauthentication No
Authorizedkeyfile Ssh/authorized_keys
The SSH configuration on each machine is complete and can be tested, such as dbrg-1 SSH connection to Dbrg-2
[dbrg@dbrg-1:~] $ssh dbrg-2
If the SSH configuration is OK, the following message appears
The authenticity of host [dbrg-2] can ' t be established.
Key fingerprint is 1024 5f:a0:0b:65:d3:82:df:ab:44:62:6d:98:9c:fe:e9:52.
Are you throaty your want to re-enters connecting (yes)?
OpenSSH tells you it doesn't know this host, but you don't have to worry about it because you're the first to log on to this host. Type Yes. This will add the "identification tag" of this host to the "~/.ssh/know_hosts" file. This message is no longer displayed when you visit this host for the second time.
Then you will find that you do not need to enter a password to establish an SSH connection, congratulations, the configuration was successful
But don't forget to test native SSH dbrg-1
Hadoop Environment variables
Set the environment variable that Hadoop needs in the hadoop_env.sh in the/home/dbrg/hadoopinstall/hadoop-conf directory, where Java_home is the variable that must be set. Hadoop_home variable can be set or not set, if not set, hadoop_home default is the bin directory of the parent directory, that is, the/home/dbrg/hadoopinstall/hadoop in this article. That's how I set it up.
Export Hadoop_home=/home/dbrg/hadoopinstall/hadoop
Export java_home=/usr/java/jdk1.6.0
From this place you can see the advantages of creating the hadoop0.12.0 link Hadoop as described earlier, and when you update the Hadoop version later, you don't need to change the profile, just make changes to the link.
Hadoop configuration file
as mentioned earlier, in the hadoop-conf/directory, open the slaves file, which specifies all the from nodes, one row specifies a host name. This is the dbrg-2,dbrg-3 in this article, so the slaves file should look like this
dbrg-2
dbrg-3
contains all the configuration items for Hadoop in the Hadoop-default.xml in the conf/directory, but does not allow direct modification! You can define the items we need in the Hadoop-site.xml in the hadoop-conf/directory, and the values will overwrite the defaults in Hadoop-default.xml. Can be customized according to their actual needs. Here's my profile:
<?xml version= "1.0"?>
<?xml-stylesheet type= "text/xsl" href= configuration.xsl "?>"
<!--put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>dbrg-1:9000</value>
<description>the Name of the default file system. Either the literal string "local" or a host:port for dfs.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>dbrg-1:9001</value>
<description>the host and port that's MapReduce job tracker SETUPCL at. If ' local ', then jobs are run in-process as a single map and reduce task.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/dbrg/HadoopInstall/tmp</value>
<description>a Base for other temporary directories.</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/dbrg/HadoopInstall/filesystem/name</value>
<description>determines where on the "local filesystem" the DFS name node should store the name table. If This is a comma-delimited list of directories then the name of the directories, for redundancy. </description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/dbrg/HadoopInstall/filesystem/data</value>
<description>determines where on the local filesystem a DFS data node should store its blocks. If is a comma-delimited list of directories, then data would be stored in all named directories, typically on different Devices. Directories that does not exist are ignored.</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>default block replication. The actual number of replications can be specified to the file is created. The default is used if replication isn't specified in Create time.</description>
</property>
</configuration>
Deploying Hadoop
With so many Hadoop environment variables and configuration files on the dbrg-1 machine, it is now time to deploy Hadoop to other machines to keep the directory structure consistent.
[dbrg@dbrg-1:~] $SCP-R/home/dbrg/hadoopinstall dbrg-2:/home/dbrg/
[dbrg@dbrg-1:~] $SCP-R/home/dbrg/hadoopinstall dbrg-3:/home/dbrg/
Now, you can say that Hadoop has been deployed on every machine, so let's start with Hadoop now.
Start Hadoop
Before we start, we need to format the Namenode first, enter the ~/hadoopinstall/hadoop directory, and execute the following command
[Dbrg@dbrg-1:hadoop] $bin/hadoop Namenode-format
No surprises, you should be prompted to format successfully. If it doesn't work, go to the hadoop/logs/directory to view the log file
Now it's time to officially start Hadoop, and there are a lot of startup scripts under bin/that can be started according to your needs.
* start-all.sh start all Hadoop daemons. including Namenode, Datanode, Jobtracker, Tasktrack
* stop-all.sh Stop all Hadoop
* start-mapred.sh start map/reduce Guardian. including Jobtracker and Tasktrack.
* Stop-mapred.sh Stop Map/reduce Guard
* start-dfs.sh start Hadoop dfs daemon Namenode and Datanode
* Stop-dfs.sh Stop Dfs Guardian
Here, simply start all Guardian
[Dbrg@dbrg-1:hadoop] $bin/start-all.sh
Similarly, if you want to stop Hadoop, you
[Dbrg@dbrg-1:hadoop] $bin/stop-all.sh
HDFs operation
Run the Hadoop command in the bin/directory to view all supported operations and their usage by haoop, for example, with a few simple actions.
Create a Directory
[Dbrg@dbrg-1:hadoop] $bin/hadoop dfs-mkdir TestDir
Create a directory named TestDir in HDFs
Copying files
[Dbrg@dbrg-1:hadoop] $bin/hadoop dfs-put/home/dbrg/large.zip testfile.zip
Copy the local file Large.zip to the root directory of the HDFs/user/dbrg/, the file name is Testfile.zip
View Existing Files
[Dbrg@dbrg-1:hadoop] $bin/hadoop dfs-ls