I started to learn about hadoop and started to understand hadoop. The first problem was hadoop distributed deployment. There are many tutorials on the Internet, there are more or less unsatisfactory things according to these tutorials. Record your installation process in detail here.
The operating system of the server is centos6, And the hadoop version deployed is 0.20.2. There are 10 servers in total, node1 ~ Node10. Node1 serves as namenode, node10 serves as secondnamenode, node2 ~ Node9 is used as the datanode.
Since I have never been familiar with the Linux operating system before, I encountered a lot of problems that seem very simple now, but had plagued me for a long time.
1. Before installing hadoop, you need to install Java. the Java version must be 1.6 or later. You also need to install SSH.
After Java is installed ~ /. Add java_home, path, class_path to bash_profile. Note that the ":" colon is used between multiple values in the PATH variable.
2. Configure SSH and namenode to start processes on other servers without logon. Create a user for the hadoop system before configuring ssh. Perform the same operation on each node.
1 groupadd hadoopcluster
2 useradd-G hadoopcluster hadoop
3 passwd hadoop // create Password
After creating a hadoop user, switch to the hadoop user ~ Create a folder under the Directory
1 mkdir .ssh
Then configure SSH Login-free: note that this operation is only performed on namenode.
1 ssh-keygen-t rsa // press Enter next, then OK
2 CP id_rsa.pub authorized_keys // enter the. Ssh directory
Next, copy the generated authorized_keys to all other nodes. Note: node2 is the hostname and can be changed to the IP address of the node. For future convenience, add the IP address and hostname of each node in/etc/hosts. Add a line: 192.168.1.12 node2
1 scp authorized_keys node2:/home/hadoop/.ssh
In the SCP process, you will need to enter the password. after entering the password, you no longer need to enter the password, so that SSH Login is complete without a password. After the copy is complete, modify the authorized_keys File Permission:
1 chmod 644 authorized_keys // operate with the root permission
3. Start to deploy hadoop. On the namenode node, use the hadoop user to decompress the hadoop folder. Because the server cannot access the network, you can only download it first, then upload the hadoop file to the server through xftp.
Decompress the package and enter the conf directory. Main configuration core-site.xml hadoop-env.sh hdfs-site.xml masters slaves five files.
In the hadoop-env.sh file: mainly add Java environment variables
1 export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64
In the core-site.xml file:
1 <property>
2 <name>hadoop.tmp.dir</name>
3 <value>/opt/hadoop/tmp</value>
4 </property>
5 <property>
6 <name>fs.default.name</name>
7 <value>hdfs://node1:54310</value>
8 </property>
In the hdfs-site.xml file:
1 <property>
2 <name>dfs.replication</name>
3 <value>3</value>
4 </property>
5 <property>
6 <name>dfs.name.dir</name>
7 <value>/opt/hadoop/hdfs/name</value>
8 </property>
9 <property>
10 <name>dfs.data.dir</name>
11 <value>/opt/hadoop/hdfs/data</value>
12 </property>
13 <property>
14 <name>dfs.namenode.secondary.http-address</name>
15 <value>node10:50090</value>
16 </property>
In the mapred-site.xml file:
1 <property>
2 <name>mapred.job.tracker</name>
3 <value>node1:54311</value>
4 </property>
In the slaves file, add the IP address of the datanode node, one line for example: 192.168.1.1
In the Masters file, note that the address added to this file refers to the IP address of secondnamenode, with one secondnamenode and multiple such nodes.
After these configurations are complete, add the hadoop bin to the environment variable ~ /. Add variables like java_home and Path in bash_profile, hadoop_home and path. After these configurations are complete, all hadoop files will be SCP to all other nodes, similar to the previous SCP authorized_keys.
In this way, the simple configuration of the hadoop cluster is basically complete, and the next step is to start the entire cluster and see what magical results your hard work has.
4. began to witness the miracle.
1./hadoop namenode-format // perform operations in the hadoop bin directory to format the data node. If the operation succeeds, there will be a successful statement.
2./start-all.sh // then all nodes are started.
If it starts normally, you can enter node1 (IP address): 50070 in the browser to view the information about the entire cluster.
Note that during the entire process, only the namenode and datanode deployed on hadoop are mentioned. In fact, after start-all, the printed information includes tasktracker and jobtracker, such words are relative to mapreduce.
Due to the initial deployment of the hadoop system, there are still some problems. When the hosts file is not configured, after the last start-all, all the data nodes are printing the same log, and the result will soon be filled with a GB hard disk. When the program is started again, the system prompts that there is no space to write.
1 2012-02-21 00:00:01,074 ERROR org.apache.hadoop.mapred.TaskTracker:
Caught exception: java.net.UnknownHostException: unknown host: node1
This is because the hosts file is not configured for datanode and node1 cannot be mapped to the corresponding IP address.
During hadoop configuration, it was also the first time that I had such formal contact with the Linux system, so I encountered a lot of problems, such as the root user and hadoop user, different users have different permissions for files created under different users. If the root user creates files, the hadoop user does not have the permission to perform operations.
When you configure the path, the two parameters are written as ";" Semicolon. When you switch to the hadoop user, ~ If a user switches to a bash statement, the system will always execute the environment variable specified in the bash_profile file. After the semicolon is reached, the system will think it is over, and the subsequent statements will identify the error.
When configuring the hadoop system, it is best to first operate the official documentation. If there is a problem in the system running, view more logs. If you think more, you will always solve the problem. In this way, you will have a better understanding of the problem. When you encounter it next time, you will immediately know where the error is and how to solve it.
Tiandao rewards.