Hadoop distributed deployment

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I started to learn about hadoop and started to understand hadoop. The first problem was hadoop distributed deployment. There are many tutorials on the Internet, there are more or less unsatisfactory things according to these tutorials. Record your installation process in detail here.

The operating system of the server is centos6, And the hadoop version deployed is 0.20.2. There are 10 servers in total, node1 ~ Node10. Node1 serves as namenode, node10 serves as secondnamenode, node2 ~ Node9 is used as the datanode.

Since I have never been familiar with the Linux operating system before, I encountered a lot of problems that seem very simple now, but had plagued me for a long time.

1. Before installing hadoop, you need to install Java. the Java version must be 1.6 or later. You also need to install SSH.

After Java is installed ~ /. Add java_home, path, class_path to bash_profile. Note that the ":" colon is used between multiple values in the PATH variable.

2. Configure SSH and namenode to start processes on other servers without logon. Create a user for the hadoop system before configuring ssh. Perform the same operation on each node.

1 groupadd hadoopcluster
2 useradd-G hadoopcluster hadoop
3 passwd hadoop // create Password

After creating a hadoop user, switch to the hadoop user ~ Create a folder under the Directory

1 mkdir .ssh

Then configure SSH Login-free: note that this operation is only performed on namenode.

1 ssh-keygen-t rsa // press Enter next, then OK
2 CP id_rsa.pub authorized_keys // enter the. Ssh directory

Next, copy the generated authorized_keys to all other nodes. Note: node2 is the hostname and can be changed to the IP address of the node. For future convenience, add the IP address and hostname of each node in/etc/hosts. Add a line: 192.168.1.12 node2

1 scp authorized_keys node2:/home/hadoop/.ssh

In the SCP process, you will need to enter the password. after entering the password, you no longer need to enter the password, so that SSH Login is complete without a password. After the copy is complete, modify the authorized_keys File Permission:

1 chmod 644 authorized_keys // operate with the root permission

3. Start to deploy hadoop. On the namenode node, use the hadoop user to decompress the hadoop folder. Because the server cannot access the network, you can only download it first, then upload the hadoop file to the server through xftp.

Decompress the package and enter the conf directory. Main configuration core-site.xml hadoop-env.sh hdfs-site.xml masters slaves five files.

In the hadoop-env.sh file: mainly add Java environment variables

1 export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64

In the core-site.xml file:

1    <property>
2      <name>hadoop.tmp.dir</name>
3      <value>/opt/hadoop/tmp</value>
4    </property>
5    <property>
6      <name>fs.default.name</name>
7      <value>hdfs://node1:54310</value>
8    </property>

In the hdfs-site.xml file:

 1    <property>
 2       <name>dfs.replication</name>
 3       <value>3</value>
 4    </property>
 5    <property>
 6       <name>dfs.name.dir</name>
 7       <value>/opt/hadoop/hdfs/name</value>
 8    </property>
 9    <property>
10       <name>dfs.data.dir</name>
11       <value>/opt/hadoop/hdfs/data</value>
12    </property>
13    <property>
14       <name>dfs.namenode.secondary.http-address</name>
15       <value>node10:50090</value>
16    </property>

In the mapred-site.xml file:

1   <property>
2      <name>mapred.job.tracker</name>
3      <value>node1:54311</value>
4   </property>

In the slaves file, add the IP address of the datanode node, one line for example: 192.168.1.1

In the Masters file, note that the address added to this file refers to the IP address of secondnamenode, with one secondnamenode and multiple such nodes.

After these configurations are complete, add the hadoop bin to the environment variable ~ /. Add variables like java_home and Path in bash_profile, hadoop_home and path. After these configurations are complete, all hadoop files will be SCP to all other nodes, similar to the previous SCP authorized_keys.

In this way, the simple configuration of the hadoop cluster is basically complete, and the next step is to start the entire cluster and see what magical results your hard work has.

4. began to witness the miracle.

1./hadoop namenode-format // perform operations in the hadoop bin directory to format the data node. If the operation succeeds, there will be a successful statement.
2./start-all.sh // then all nodes are started.

If it starts normally, you can enter node1 (IP address): 50070 in the browser to view the information about the entire cluster.

Note that during the entire process, only the namenode and datanode deployed on hadoop are mentioned. In fact, after start-all, the printed information includes tasktracker and jobtracker, such words are relative to mapreduce.

Due to the initial deployment of the hadoop system, there are still some problems. When the hosts file is not configured, after the last start-all, all the data nodes are printing the same log, and the result will soon be filled with a GB hard disk. When the program is started again, the system prompts that there is no space to write.

1 2012-02-21 00:00:01,074 ERROR org.apache.hadoop.mapred.TaskTracker:

      Caught exception: java.net.UnknownHostException: unknown host: node1

This is because the hosts file is not configured for datanode and node1 cannot be mapped to the corresponding IP address.

During hadoop configuration, it was also the first time that I had such formal contact with the Linux system, so I encountered a lot of problems, such as the root user and hadoop user, different users have different permissions for files created under different users. If the root user creates files, the hadoop user does not have the permission to perform operations.

When you configure the path, the two parameters are written as ";" Semicolon. When you switch to the hadoop user, ~ If a user switches to a bash statement, the system will always execute the environment variable specified in the bash_profile file. After the semicolon is reached, the system will think it is over, and the subsequent statements will identify the error.

When configuring the hadoop system, it is best to first operate the official documentation. If there is a problem in the system running, view more logs. If you think more, you will always solve the problem. In this way, you will have a better understanding of the problem. When you encounter it next time, you will immediately know where the error is and how to solve it.

Tiandao rewards.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop distributed deployment

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support