Big Data series (3)--hadoop cluster fully distributed bad environment construction

Source: Internet
Author: User

Objective

We explained the installation of the Hadoop single node, and we have installed a CentOS 6.8 Linux system through VMware, our goal is to configure a truly fully distributed Hadoop cluster, gossip less, into this topic.

Technical preparation

VMware virtual machine, CentOS 6.8 bit

installation process

Let's take a look at the last one we completed with a single-node Hadoop environment configuration, already configured with a CentOS 6.8 and completed the Java Runtime Environment, the Hosts file configuration, computer name and many other details.

In fact, we've done half of the Hadoop cluster setup since we finished this step, because we know that the benefit of building a virtual machine is to copy the machine directly . Multiple simultaneous operations to reduce the time wasted on separate configurations. This is also the benefit of virtualization technology.

Here, let's go in the detailed operation of the distributed system.

1. You first need to copy the previously created single-instance computer in VMware.

Here, based on the planning of the first article, we need to clone at least three more computers as data storage for datanode data nodes. The previous machine was managed as the Master Master node.

Here first to comb the entire Hadoop cluster's physical architecture diagram, we have a direct concept and understanding, the above table has been and clear, a total of 5 servers to use, four to build Hadoop cluster use, and another ( optional ) is used as a perimeter management Hadoop cluster such as MySQL.

We generally manage the entire Hadoop cluster by connecting this machine directly to the periphery when we develop it.

According to the above physical planning diagram should have a clear understanding of the entire architecture, OK, let's do the actual exercise.

Copying virtual machines in VMware is a relatively straightforward process. As follows:

Then, just the next step, it's important to remember that you must choose to clone a complete instead of a snapshot.

Then, you can enter the machine name according to the computer name. After cloning the machine is as follows:

2, configure the machine information of each slave node.

The configuration of each slave server is basically divided into the following basic parts:

    • First you need to manually change the computer name and Hosts file for each slave node (must!). )
    • Then configure each slave node memory value, in the first article I have analyzed, here can set the memory value here than the master node less, (Local tyrants company ignored!) )
    • The last configuration is the storage, this self based on the previous calculation formula can be.

First, to enter the various machines to change the Hosts file and computer name, in the previous article I have introduced, you can go through a page, here directly write the script as follows:

vim/etc/sysconfig//etc/hosts

The computer name and the hosts configuration file are completed as planned, and the IP address of the same network is set to a fixed address according to the plan.

After all this is configured, after restarting each machine, ensure that the nodes can be ping through (Focus!!! )。

Then the rest of the memory configuration, directly shut down the virtual machine, in VMware settings can be, very simple.

Adjust it as needed, and then if possible, try to set the primary node Master's Cup processor to multi-core, so the reason for the setup, I have already analyzed in detail in the first article.

At this point, the basic configuration of each server has been completed.

After a series of processes above, it is not found that the virtual machine copy this way to save a lot of additional configuration time, such as: Install the operating system, download the Hadoop installation package, build Java environment.

3, configure SSH without password configuration.

First to explain the concept and use of SSH;

SSH is the abbreviation for secure Shell , which is developed by the IETF Network Working Group, and SSH is a security protocol based on the application layer and the transport layer. SSH is currently a more reliable protocol that provides security for Telnet sessions and other network services. The use of SSH protocol can effectively prevent the information leakage in the remote management process. SSH was originally a program on a UNIX system, and later expanded quickly to other operating platforms. SSH can compensate for vulnerabilities in the network when it is used correctly. The SSH client is available on a variety of platforms. Almost all UNIX platforms-including HP-UX,Linux,AIX,Solaris,Digital UNIX, Irix, as well as other platforms, can run SSH.

Above is the official meaning of SSH, from the Baidu Encyclopedia.

Let me summarize the purpose of SSH in the Hadoop cluster.

The so-called ssh simple sentence is: The same user without password login to each machine . In fact, all of the Hadoop cluster as a distributed computing framework, need to operate the services of each node, and the operation of the process needs to be unified by a same user, but the same user log on different servers require a password or key for authentication. To avoid this verification process, a unified security protocol is used: SSH.

In fact, the principle of SSH is very simple, is to advance the unified user's password encryption to form a key for distribution, and then distributed to the various servers, each server to the secret key into the current system user group, so that the user does not need to enter a password to login operation.

Hope, I explained above you crossing can see the meaning inside.

Let's take a practical action:

    • First of all, the sshd configuration file modification, remove the default comments, turn on the SSH authentication function ( as root user operation ).
vim/etc/ssh/sshd_config

Remove the comment "#" from the above three rows of data to save it. here, remember! All of the machines are set up in this order.

A brief explanation of the above three lines of data meaning: 1, the first rsaauthentication is to open SSH authentication, 2, pubkeyauthetication refers to the public key can be verified, 3, Authorizedkeysfile refers to the location where the public key is stored.

Remember, after the configuration is complete, restart the service with the following script:

Can be verified under, for example here I directly SSH login to the native system:

ssh localhost

Here you can see that ya let me enter the password, so this is just to turn on SSH authentication, but no key generated, set.

    • Processing generate certificate Public private key, distributed to each server ( in Hadoop user Operation ).

This step is the process that I analyzed above, we need to generate the public key of the Hadoop user on the master node, and then distribute the public key to each slave node, then we can log on to the Salve machine with Hadoop without password on the master machine.

The steps are as follows:

Ssh-keygen-t Rsa-p '

Here, the ' P ' behind the-P is capitalized.

The path I ticked with the red box above is the default path generated by the public and private keys.

Then, the next step is to copy the public key to each slave node,

To replicate remote files with the following Linux commands, the script commands are as follows:

SCP ~/. ssh/id_rsa.pub remote user name @ remote server ip:~/

We have the default path "/home/hadoop/.ssh" for the public key file to be copied, so the command executed is

SCP ~/. ssh/id_rsa.pub [email protected]192.168. 1.51: ~/

Then we need to log in to 192.168.1.51 's salve01 machine to add the public key that was just generated to the local permission validation group.

cat ~/id_rsa.pub >> ~/. ssh/authorized_keys

The above command is to be carried out on the SLAVE01 machine and use the Hadoop user's operation this time.

Finally, let's go back to the master machine and SSH verification.

The commands for SSH authentication are simple, in the following format:

SSH  < remote IP && domain name >

So, here we are on the master machine on the SLAVE01 machine on the experiment to see if you still need to enter the password.

ssh Slave01.hadoop

We can see from the command window above that we have successfully logged on to the SLAVE01 machine with no password on the master machine. Then the configuration is now in effect.

    • Follow the steps above to configure the individual slave nodes to complete.

Here need to configure the remaining two slave node for no password login, detailed operation process reference above the process can be, the need to note is: the master generated key only need to generate once, do not build again! Because each build is meant to be reconfigured for all nodes.

The effect of the configuration is to ensure that Hadoop users on the master machine can do so without having to log on to each slave node with no password.

Through the above operation, we have ensured that our Master machine can operate the slave of each sub-node without hindrance.

    • Refer to the steps above to ssh each slave node to the master machine.

We know that after a series of operations above, our master master node can successfully manipulate each slave node, however, it is important to note that in order to ensure that each slave machine and master machine to communicate.

It is necessary to ensure that each slave node can log on to master machine without password , and the operation steps are as above.

The reason for this is simple, after each slave child node has finished the tasks assigned by master, they need to have permission to feed back to their boss master!

Well, here we have completed the SSH configuration for the entire cluster.

Here again, the above steps to be sure to complete the verification, or later Hadoop operations will be a variety of strange problems, so you are unprepared, this is experience!!

4. Configure the Hadoop cluster configuration.

Well, here we need to configure Hadoop on each machine. We know that all the machines here are copied from a machine, because we have already installed a single instance of Hadoop on this machine, referring to the previous article.

The next step, then, is to configure this single node as a truly distributed cluster, taking advantage of the performance of the few servers we just built.

The configuration here is not many, just need to change a few files on it.

    • The configuration of the slaves file is first performed, specifying the location of each slave node of the cluster ( operating as a Hadoop user).

This only needs to be done on the master machine, and of course, if you don't mind being able to keep all of your machines on top of the Hadoop configuration. Execute the command as follows

vim/usr/hadoop/hadoop-2.6. 4/etc/hadoop/slaves

Then, the IP or machine name of each slave is written on it, and a single machine row of data. Here, I'm writing IP.

That's all you can do.

    • Then, change the value of the Dfs.replication property in the Hdfs-site.xml file.

I have explained this value before, because we are not a single machine now, so we will change this node to 3 or greater, because we have four machines, so this is configured to 3. Remember: only an odd number!

vim/usr/hadoop/hadoop-2.6. 4

<property>  <name>dfs.replication</name>  <value>3</value> </ Property>

It is important to note that all machines are configured in this way .

5. Start the Hadoop cluster and verify the success.

In this, we have basically completed a Hadoop fully distributed cluster configuration. Here's what we'll do to verify that it's available.

The verification method is very simple, first we have to execute a HDFS format command, which we have analyzed in the previous article, because we have changed to a fully distributed cluster here, so we need to re-format.

Bin/hadoop Namenode-format

    • First, let's verify that HDFs for the entire cluster is available.

Start the entire cluster of HDFs, on the master machine, with the Hadoop user operation, the command is as follows:

Start-dfs. SH

We view the HDFs status of the entire cluster via the browser at: Http://192.168.1.50:50070/dfshealth.html#tab-overview

As you can see, the HDFs cluster of our Hadoop cluster has been successfully started, then we look at the storage and the number of nodes of the whole cluster;

As we can see from the above, there are four datanode nodes in the current cluster, which is the IP of the slave file we have just configured. This means that the cluster HDFS we configured will work correctly.

    • Then, let's verify that the yarn distributed computing framework for the entire cluster is available.

In the same way, we first start yarn. The script is as follows:

Start-yarn. SH

We use the browser to view the entire cluster's Hadoop cluster status at address: http://192.168.1.50:8088/

As you can see, the current Hadoop cluster already has four running nodes, and it runs very happy. Later articles I will analyze how to use this Hadoop cluster.

Conclusion

In this article, we will introduce the Hadoop big data cluster, such as using zookeeper to build Hadoop high-availability platform, Map-reducer sequence development, data analysis of hive products, development of spark applications, Hue's cluster of bad environment integration and operation, SQOOP2 data extraction, etc., interested in children's shoes can be noticed in advance.

This article mainly introduces the establishment of a fully distributed Hadoop cluster, we will gradually improve it, I will teach you how to use a fully distributed Hadoop cluster, and then teach you how to use it, SAO years ... Don't catch the rush ... Let the mind fly for a while ...

Questions can be message or private messages, at any time waiting for interested children's shoes to increase the data platform in-depth study. Learn together and progress together.

At the end of the article, I give the basic article of the previous article:

Big Data Series (1)--hadoop cluster bad environment configuration

Big Data series (2)--hadoop cluster bad-environment CentOS installation

If you read this blog, feel that you have something to gain, please do not skimp on your " recommendation ".

Big Data series (3)--hadoop cluster fully distributed bad environment construction

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.