This series of articles describes how to install and configure hadoop in full distribution mode and some basic operations in full distribution mode. Prepare to use a single-host call before joining the node. This article only describes how to install and configure a single node.
1. Install Namenode and JobTracker
This is the first and most critical cluster in full distribution mode. Use VMWARE virtual Ubuntu Linux 11.10 server. This article does not focus on Linux installation. By default, a user named abc has the sudo permission. The root password is a random password and can only be temporarily promoted to the root permission using the sudo command. To be safe, the first thing to do after installing the system is to change the root password. With sudo passwd root, the system will not ask you for the original password, just enter the new password twice. The root password is in the hand, so you won't be helpless in case of Operation errors.
1.1 install JDK
There is a command to install jdk quickly, sudo apt-get install sun-java6-jdk, which is a mechanism of the ubuntu system itself. I tried it and did not succeed. Remember to say that the package could not be found. I don't know why, or the network doesn't work, or the package name is incorrect. Give up. You have to find another method.
Go to the Oracle website and find the latest JDK 1.6.0 _ 31 in JDK 1.6. The download link is as follows:
This is because it is a 32-bit JDK, and Ubuntu Linux selects a 32-bit JDK. JDK also selects 32 bits. Click "Accept License Agreement", right-click to "jdk-6u31-linux-i586.bin", get its link in properties: http://download.oracle.com/otn-pub/java/jdk/6u31-b04/jdk-6u31-linux-i586.bin, then return to the virtual machine, log on with abc, enter the command:
Wget http://download.oracle.com/otn-pub/java/jdk/6u31-b04/jdk-6u31-linux-i586.bin
This download takes some time. After the download is complete, there will be a jdk-6u31-linux-i586.bin file under/home/abc.
Sudo mkdir/usr/lib/jvm
Cd/usr/lib/jvm
Sudo mkdir java
Cd java
Sudo cp/home/abc/jdk-6u31-linux-i586.bin.
Sudo chmod 777 jdk-6u31-linux-i586.bin
/Jdk-6u31-linux-i586.bin
Then, install jdk. It will be installed later.
Sudo vi/etc/environment
Modify the file as follows:
Add/usr/lib/jvm/java/jdk1.6.0 _ 31/bin after the line of PATH. Note that the colon Before/usr is required.
Add these two lines:
CLASSPATH =.:/usr/lib/jvm/java/jdk1.6.0 _ 31/lib
JAVA_HOME =/usr/lib/jvm/java/jdk1.6.0 _ 31
Save
Note: In some cases, the linux system will install some packages such as the openjdk package by default, which will cause coexistence of multiple JVMs. You also need to use the update-alternatives command to select the default jvm to the jdk directory just installed.
It is found that Ubuntu Linux11.10 server does not have any other jdk packages installed by default, and java commands cannot be run before, so you do not need to run the update-alternatives command.
Sudo reboot
1.2 create hadoop users and hadoop groups
After the system is restarted, log on to the abc user
Sudo addgroup hadoop
Sudo adduser -- ingroup hadoop
Enter the new password twice, and then enter irrelevant information until the command is completed. The hadoop user is created.
Su
Enter the root password and replace it with the root user.
Continue to enter the command:
Chmod u + w/etc/sudoers
Vi/etc/sudoers
Add a row after this row: root ALL = (ALL: ALL) ALL:
Hadoop ALL = (ALL: ALL) ALL
This means that the hadoop user sudo is allowed to run any command.
Save
Chmod u-w/etc/sudoers
This is to change the sudoers file permission back to 440, that is, the root user is usually read-only. When running the sudo command in Ubuntu linux, the system checks whether the file has 440 permissions. If it is not 440, The sudo command cannot work. Therefore, you must change it back to the original 440.
The task as the root user ends. Enter exit to exit the root user.
Enter exit to exit the abc user.
1.3 configure the SSH Key so that hadoop users can log on to the cluster without a password
Log On with the user hadoop just created,
Sudo apt-get install ssh
Ssh-keygen-t dsa-p'-f ~ /. Ssh/id_dsa
Cat ~ /. Ssh/id_dsa.pub> ~ /. Ssh/authorized_keys
Run the ssh localhost command to test whether ssh works. If you do not need to enter a password, it is correct.
1.4 check the host name, modify/etc/hostname,/etc/hosts
Sudo vi/etc/hostname
Check whether the automatically assigned host name is appropriate. If not, change it to a meaningful name, such as namenode and save
Ifconfig
Check the current IP address and record it.
Sudo vi/etc/hosts
The two rows starting with 127 do not need to be moved,
Add the recorded IP address and the new host name to save
This is important. Otherwise, the reduce step of jobtracker may be abnormal.
1.5 install the hadoop package
Choose http://hadoop.apache.org/common/releases.htmlto find a stable. Find 0.20.203.0 and find an image site. Download this package to the/home/hadoop/directory.
Continue to input commands as a hadoop user
Sudo mkdir/usr/local/hadoop
Sudo chown hadoop: hadoop/usr/local/hadoop
Cp/ home/hadoop/hadoop-0.20.203.0rc1.tar.gz/usr/local/hadoop
Cd/usr/local/hadoop
Tar zxvf hadoop-0.20.203.0rc1.tar.gz
Cd hadoop-0.20.203.0/conf
Vi hadoop-env.sh
Change this line to export JAVA_HOME =/usr/lib/jvm/java/jdk1.6.0 _ 31.
Vi core-site.xml
It is empty. Change the content:
<Configuration>
<Property>
<Name> fs. default. name </name>
<Value> hdfs :/// namenode: 9000 </value>
</Property>
<Property>
<Name> hadoop. tmp. dir </name>
<Value>/home/hadoop/tmp </value>
</Property>
</Configuration>
Vi hdfs-site.xml, add:
<Property>
<Name> fs. default. name </name>
<Value> hdfs :/// namenode: 9000 </value>
</Property>
<Property>
<Name> hadoop. tmp. dir </name>
<Value>/home/hadoop/tmp </value>
</Property>
<Property>
<Name> dfs. replication </name>
<Value> 1 </value>
</Property>
This dfs. replication indicates the number of copies of data replication, and the production environment cannot be 1. Of course, it must be greater than 1.
Vi mapred-site.xml, change it:
<Property>
<Name> fs. default. name </name>
<Value> hdfs :/// namenode: 9000 </value>
</Property>
<Property>
<Name> hadoop. tmp. dir </name>
<Value>/home/hadoop/tmp </value>
</Property>
<Property>
<Name> dfs. replication </name>
<Value> 1 </value>
</Property>
<Property>
<Name> mapred. job. tracker </name>
<Value> namenode: 9001 </value>
</Property>
Note that jobtracker and namenode use the same host, that is, on the same machine, the production environment can be split into two machines by namenode and jobtracker.
All are changed. Modify the PATH variable:
Sudo vi/etc/environment
Add:/usr/local/hadoop/hadoop-0.20.203.0/bin behind the line of PATH to save it, so that hadoop commands can be available at any time.
Restart sudo reboot
1.6 format hdfs
Log on with a hadoop user,
Hadoop namenode-format
1.7 start configuration and verification of this single machine
Start-all.sh
A single hadoop node is started.
Verification can be performed using:
Jps
The result is correct:
3156 NameNode
2743 SecondaryNameNode
Jps 3447
2807 JobTracker
2909 TaskTracker
2638 DataNode
Hadoop dfsadmin-report
Displays hdfs information.
Access http: // namenode: 50070/display hdfs Information
Http: // namenode: 50030/displays jobtracker information.
You can also use some common commands to place files on hdfs, such
Hadoop fs-put test.txt/user/hadoop/test. text
The above can prove that hdfs is basically normal. Next we will verify that jobtracker and taskTracker are normal and prepare to run the wordcount program in hadoop example.
Cd/usr/local/hadoop/hadoop-0.20.203.0
Hadoop fs-put conf input
Copy the conf directory to hdfs
Hadoop jar hadoop-examples-0.20.203.0.jar wordcount input output
This is probably the right result, that is, map increases to 100%, reduce also increases to 100%,
12/03/05 07:52:09 INFO input. FileInputFormat: Total input paths to process: 15
12/03/05 07:52:09 INFO mapred. JobClient: Running job: job_201203050735_0001
12/03/05 07:52:10 INFO mapred. JobClient: map 0% reduce 0%
12/03/05 07:52:24 INFO mapred. JobClient: map 13% reduce 0%
12/03/05 07:52:25 INFO mapred. JobClient: map 26% reduce 0%
12/03/05 07:52:30 INFO mapred. JobClient: map 40% reduce 0%
12/03/05 07:52:31 INFO mapred. JobClient: map 53% reduce 0%
12/03/05 07:52:36 INFO mapred. JobClient: map 66% reduce 13%
12/03/05 07:52:37 INFO mapred. JobClient: map 80% reduce 13%
12/03/05 07:52:39 INFO mapred. JobClient: map 80% reduce 17%
12/03/05 07:52:42 INFO mapred. JobClient: map 100% reduce 17%
12/03/05 07:52:51 INFO mapred. JobClient: map 100% reduce 100%
12/03/05 07:52:56 INFO mapred. JobClient: Job complete: job_201203050735_0001
12/03/05 07:52:56 INFO mapred. JobClient: Counters: 26
12/03/05 07:52:56 INFO mapred. JobClient: Job Counters
12/03/05 07:52:56 INFO mapred. JobClient: Launched reduce tasks = 1
12/03/05 07:52:56 INFO mapred. JobClient: SLOTS_MILLIS_MAPS = 68532
12/03/05 07:52:56 INFO mapred. JobClient: Total time spent by all CES waiting after reserving slots (MS) = 0
12/03/05 07:52:56 INFO mapred. JobClient: Total time spent by all maps waiting after reserving slots (MS) = 0
12/03/05 07:52:56 INFO mapred. JobClient: Rack-local map tasks = 7
12/03/05 07:52:56 INFO mapred. JobClient: Launched map tasks = 15
12/03/05 07:52:56 INFO mapred. JobClient: Data-local map tasks = 8
12/03/05 07:52:56 INFO mapred. JobClient: SLOTS_MILLIS_REDUCES = 25151
12/03/05 07:52:56 INFO mapred. JobClient: File Output Format Counters
12/03/05 07:52:56 INFO mapred. JobClient: Bytes Written = 14249
12/03/05 07:52:56 INFO mapred. JobClient: FileSystemCounters
12/03/05 07:52:56 INFO mapred. JobClient: FILE_BYTES_READ = 21493
12/03/05 07:52:56 INFO mapred. JobClient: HDFS_BYTES_READ = 27707
12/03/05 07:52:56 INFO mapred. JobClient: FILE_BYTES_WRITTEN = 384596
12/03/05 07:52:56 INFO mapred. JobClient: HDFS_BYTES_WRITTEN = 14249
12/03/05 07:52:56 INFO mapred. JobClient: File Input Format Counters
12/03/05 07:52:56 INFO mapred. JobClient: Bytes Read = 25869
12/03/05 07:52:56 INFO mapred. JobClient: Map-Reduce Framework
12/03/05 07:52:56 INFO mapred. JobClient: Reduce input groups = 754
12/03/05 07:52:56 INFO mapred. JobClient: Map output materialized bytes = 21577
12/03/05 07:52:56 INFO mapred. JobClient: Combine output records = 1047
12/03/05 07:52:56 INFO mapred. JobClient: Map input records = 734
12/03/05 07:52:56 INFO mapred. JobClient: Reduce shuffle bytes = 21577
12/03/05 07:52:56 INFO mapred. JobClient: Reduce output records = 754
12/03/05 07:52:56 INFO mapred. JobClient: Spilled Records = 2094
12/03/05 07:52:56 INFO mapred. JobClient: Map output bytes = 34601
12/03/05 07:52:56 INFO mapred. JobClient: Combine input records = 2526
12/03/05 07:52:56 INFO mapred. JobClient: Map output records = 2526
12/03/05 07:52:56 INFO mapred. JobClient: SPLIT_RAW_BYTES = 1838
12/03/05 07:52:56 INFO mapred. JobClient: Reduce input records = 1047
Finally, hadoop fs-get output/home/hadoop obtains the output directory locally to view the results.
1.8 single-host stop
Stop-all.sh
1.9 Problems
Too issue fetch-failures
When running the wordcount example, the reduce task cannot reach 100% and always gets stuck at 0%.
The analysis log contains the Too uninstall fetch-failures information. After checking the information online, some people say they want to write the IP address and hostname to/etc/hosts.
As shown in the following figure, I found that it still does not work. At the beginning of the day, after unremitting efforts, we finally found the crux: the original host name for ubuntu linux is 192, and this host name 192 has become the standard configuration of hdfs. Even if I change the host name to a meaningful name later, this order is no longer correct, because the old host name is still used in the xml file of each task in the logs directory, the new host name is useless, and where does the old host name exist? Later, we found that the host name exists in those files in the hdfs file system. Therefore, you need to repeat steps 1 to 2. After doing this again, you can run the wordcount sample program successfully.
I have answered questions from my colleagues in the QQ group:
Running hadoop namenode-format indicates that the main function cannot be found.
A: The CLASSPATH settings are incorrect.
There are so many installation configurations on a single machine. This article is about this. This article will discuss how to add new hadoop nodes to make multiple nodes a fully-step cluster.
If you think this article is helpful, please help us with the recommendations. Thank you.