Hadoop cluster configuration experience (low configuration cluster + automatic synchronization configuration)

Source: Internet
Author: User
Tags tmp folder ssh server

This article is original to me. It was first launched to earn gold http://f.w.uru.cn/thread-138720-1-1.html.




This is the case. I don't have a very powerful computer to create a server cluster with very high Nb performance. I believe many people are similar to me, so now I have written down my low-configuration cluster experience.


My equipment: 1) saeyang Single-core processor 2 GB memory notebook before December 3, year 56) The thinkpadt420 and I5 dual-core processor 4G memory (available memory only 3.4 GB, it's because the 32-bit system is installed ...)


It is obviously unrealistic to use a company-configured computer to create three 1 GB memory virtual machines. Enterprise laptops run a lot of software, and there is no spare memory or less than 3 GB. So, my idea is:

Use my own notebook (pC1) as the master node to run jobtracker, namenode and secondarynamenode; use the company's notebook to run two virtual machines (VM1 and VM2) for slave nodes, run tasktracker and datanode. In this case, we need to place pC1, VM1, and VM2 in the same network segment to ensure that they can communicate with each other.


Network Environment: Both of my computers use a wireless route to access the Internet.


Create a virtual machine configuration process in the same network segment as an external computer:

Preparation: To build a cluster, you must first ensure that each server has a fixed IP address before subsequent operations can be performed. So, first set all my two laptops to a fixed IP address (Note: if you use a wireless route to access the Internet like me, set the IP address of the wireless network card to a fixed IP address ). The pC1: of the master node is used to run the VM host notebook Target: the IP addresses of VM1 and VM2 are respectively set to and 152.



1) create a VM1 VM.

2) Open the NIC setting interface of VM1, and select bridge as the connection method. (Bridging)

For more information about bridging, refer to Baidu. What we need to know is that the bridge method can allow virtual machines to access the Internet through the gateway of the local machine, so they can be in the same network segment with the local machine and communicate with each other.

3) I use VMware workstation8, taking it as an example: menu edit-virtualnetwork editor.

4) Select wmnet0 (this is a configuration file reserved for bridging this connection mode by default during VMware installation), and then click the following bridgedto, which is the network card to connect, which network card is used to access the Internet. I am not sure about the reason. I used to select automatic but cannot access the Internet normally. So I can select the network card that is connected to the network manually. (For example, if I use a wireless network card to connect to a vro to access the Internet, select my own wireless network card)

5) then, click OK. Go to the system and set the IP address to a fixed IP Address: I set it easily using the graphic interface. The settings page is shown in the following figure:

6) Now you can ping each other. If Ping is enabled, the connection is connected to each other.

If not, check ifconfig to see if their respective IP addresses are successfully set and whether they need to be reconnected.


According to this step, there should be no problem... If you do not understand what the problem is, you can tell me how to communicate with each other.



After the virtual machine is set, you can set ssh. Before that, you must set up an account. All three servers have the same account.


Ubuntu does not have an SSH server by default (centos can be installed by default). Therefore, it is impossible to SSH to each other after sshkey is set in the instructor's video. (You can use various command verification methods, such as telnetlocalhost 22, to verify whether there is a program listening on port 22 on the local machine. If the connection fails, the SSH server is not set. Or run commands such as netstat-ano | grep ssh .)


For SSH installation, it seems that OpenSSH is the actual SSH implementation standard on Linux systems, and Ubuntu can also easily install OpenSSH.

Sudo apt-Get installopenssh-Server


One line of command is enough. Try Telnet after installation to verify whether the installation is successful.

Then, generate a public/private key based on the method in the instructor's video to gather all the public keys and spread them to each server. You can achieve login without a password.


Note! In authorized_keyAfter placing it in the specified location, do not manually SSHLast time to all nodes!

For example, for example, if you want to manually SSH to all other nodes by entering the command SSH <VM1 ip> and SSH <vm2ip> from pC1, after you see whether you want to trust the prompt of the target host, enter yes. Then, the setting is complete. This is a necessary operation. Only in this way can all other servers be added to the Trust List of the local machine, and password-free login without any prompt can be realized.

Otherwise, when you start a daemon on another node, a prompt such as unable to connect to the corresponding node or start is displayed.


In this case, my cluster environment with low configuration is probably ready.


Then configure hadoop.


My idea about hadoop configuration is that there must be a synchronization mechanism for such a large cluster. Otherwise, each time you change the configuration file, you need to change it on all nodes, how much work is that ...... So I read the configuration that started on page 269 of the authoritative guide material. This configuration does exist. In addition, some configurations are not mentioned by the teacher, and I will take my own understanding and practices as an example to write them out.


The main configuration file is actually hadoop-env.sh and core-site.xml, hdfs-site.xml and mapred-site.xml.


First look at the hadoop-env.sh:

First, configure java_home. Needless to say, download the latest GZ package from Oracle and decompress it directly. Set the path.

Then I think these configurations are very useful:


Export hadoop_heapsize = 400

# This is used to set the memory space allocated to the daemon. All daemon processes are Java processes. You can use this to set the heapsize of JVM. I set 400 mb here. The default value is 1000 MB.


Let's just talk about my design. Because I have low configuration, if all of them follow the default settings, such as running jobtracker, namenode and secoundarynamenode on my 2G notebook, each process is allocated 1000 MB, if a large amount of data comes in, the allocated memory space will be full. At that time, my computer will definitely crash.

Therefore, my design scheme is as follows:

My old notebook (2 GB memory): jobtracker: 400 m, namenode: 800 m, secondarynamenode: 800 m. (According to my understanding, namenode manages memory and IO in a centralized manner, so it requires a lot of space to operate)

Virtual Machine VM1 and 2 (1 GB memory each): tasktracker: 400 m, datanode: 400 m, and then one map task and one reduce task sub-process each started by tasktracker, each memory is allocated MB.


I personally think that at least the computer will not crash ...... Wait until the data processing starts. Now, let's look at the situation and configure it based on my own rough understanding.


Export hadoop_namenode_opts = "-xmx800m-DCOM. Sun. Management. jmxremote $ hadoop_namenode_opts"

Export hadoop_secondarynamenode_opts = "-xmx800m-DCOM. Sun. Management. jmxremote $ hadoop_secondarynamenode_opts"

# These two configurations are called namenode and secondarynamenode which allocate a heapsize of MB. According to the verification, this setting is OK.

Export hadoop_master = dellypc-MASTER:/home/$ user/hadoop-0.20.2
# HOST: path where hadoop code shoshould be rsync 'd from. unset by default. # Where to synchronize hadoop code, which is not synchronized by default. Once this is set, you can change only one configuration file each time you change the configuration file. You can see it in the startup information at startup, all nodes synchronize the configuration file from the configured location.
For example, my master node hostname is dellypc-master, and all hadoop configuration files are placed in/home/Delly/hadoop-0.20.2 ($ user will be recognized as the current user name, so every time you start the hadoop daemon, hadoop automatically synchronizes all the configuration files from the/home/Delly/hadoop-0.20.2 location of dellypc-master, then start the operation.
It is worth noting that the storage of this configuration file is not necessarily a machine on the hadoop node. According to the authoritative document, it can be an external server.This is actually quite good. In actual use, you can build a machine dedicated to storing configurations. The server cluster must be isolated to the DMZ network through the firewall, activate the RPC file synchronization port from the configuration machine to the Cluster machine, and change it directly through the machine outside the cluster each time you need to change the configuration, it is safe and easy to use when you start the cluster daemon to synchronize to all nodes.

Export hadoop_slave_sleep = 0.1
# Seconds to sleep between slave commands. unset by default. This
# Can be useful in large clusters, where, e.g., slave rsyncs can
# Otherwise arrive faster than the master can service them.

# I don't know how it works, but it seems that at startup, the master node will take a free time to rest for 0.1 seconds, so as to avoid high load and downtime when requests reach the master node at the same time.

The configuration of the hadoop-env.sh is written here, there are other log location, PID location and so on, just set it yourself. Some of the settings seem to be understandable and seem to be useful. You may understand them more deeply and know how to set them correctly.

Next core-site.xml:
In addition to setting the HDFS location and port in the video tutorial (I believe fs. Default. name also specifies the server on which namenode runs ..), The following items are also configured:
<Name> Io. file. Buffer. Size </Name>
<Value> 131072 </value>
<! -- Set the buffer to assist in I/O operations. The default value is 4 kb. Set it to kb. -->
<Final> true </FINAL>

Then: hdfs-site.xml:
In addition to the settings in the video, the following settings are also set:
<Name> DFS. Name. dir </Name>
<Value>/home/Delly/hadoop-0.20.2/Name,/Media/backup/hadoop-backup/namedata </value>
<Final> true </FINAL>
(Namenode is the directory where permanent metadata is stored. Two directories are set here. In case of data problems in one directory, the other can be restored as redundant data)

<Name> fs. Checkpoint. dir </Name>
<Value>/home/Delly/hadoop-0.20.2/namesecondary,/Media/backup/hadoop-backup/namesecondarydata </value>
<Final> true </FINAL>

(Same as above, but only the location where secondarynamenode is stored)

<Name> DFS. Data. dir </Name>
<Value>/home/Delly/hadoop-0.20.2/Data </value>
(The location where datanode stores data blocks. The default value is the TMP folder. It is not safe ...)
<Name> DFS. Replication </Name>
<Value> 2 </value>
(The number of copies for Data Replication does not quite understand what data replication is. If you have any understanding, please advise)
<Name> DFS. Block. Size </Name>
<Value> 134217728 </value>
<! -- 128 MB -->
(HDFS block size. The default value is 64 mb. Set this parameter to 128 MB to reduce the namenode memory pressure)

Then mapred-site.xml file:
The following configuration is used to specify the maximum number of map tasks and reduce task subprocesses and their memory (I allocated 200 MB)
<Name> mapred. tasktracker. Map. Tasks. Maximum </Name>
<Value> 1 </value>
<Final> true </FINAL>

<Name> mapred. tasktracker. Reduce. Tasks. Maximum </Name>
<Value> 1 </value>
<Final> true </FINAL>

<Name> mapred. Child. java. opts </Name>
<Value>-xmx200m </value>

Write the main configuration here.

I have a huge question: How does hadoop control which process runs on which node?

In the configuration file, I only see the HDFS host (this should be the host where the namenode is located ?), Then there is the host where jobtracker is located. However, the node on which secondarynamenode and tasktracker datanode are to be executed is not configured in the configuration file, however, secoundarynamenode does run on the expected master node. datanode and tasktracker do run on two slave nodes. Why?

Can someone answer your questions? Thank you very much!


After testing and viewing the information, your questions have been resolved: http://f.dataguru.cn/thread-138720-1-1.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.