Configuring the Spark cluster on top of Hadoop yarn (i)

Source: Internet
Author: User
Tags mkdir requires ssh ssh server hdfs dfs apache mesos hadoop fs dns spoofing
Preface

I recently contacted Spark and wanted to experiment with a small-scale spark distributed cluster in the lab. Although only with a single stand-alone version (standalone) of the pseudo-distributed cluster can also do experiments, but the sense of little meaning, but also in order to realistically restore the real production environment, after looking at some information, know that spark operation requires external resource scheduling system to support, mainly: standalone Deploy mode, Amazon EC2, Apache Mesos, Hadoop YARN. Because yarn was popular, I decided to configure the spark distributed cluster on top of Hadoop yarn.

For some of the more superficial, preliminary understandings of Hadoop and Spark, you can read my blog post, "Some superficial understanding of Hadoop and spark." Said the wrong place or there is a supplementary place also please all Bo friends to put forward ~ I will also in the future study constantly revised and updated.

This article is mainly about the cluster configuration under Ubuntu14.04 Hadoop2.7.2. Environment Introduction system: Ubuntu14.04 64-bit JDK version: JDK 1.7 Hadoop version: Hadoop 2.7.2

Cluster Environment:

role hostname IP
Master Wlw 192.168.1.103
Slave Zcq-pc 192.168.1.105
Create a Hadoop user

It is important to note that the Hadoop cluster requires the same user name on each master and slave node. Here I use a unified user named "Hadoop".

If the user on your node is not unified, you can use the following command to create a new user:

Create user, named "Hadoop"
sudo adduser Hadoop

Set Password
sudo passwd Hadoop

Create a directory for Hadoop users to log in
sudo mkdir/home/hadoop

Change the owner of the specified directory to a Hadoop user
sudo chown hadoop/home/hadoop

Consider adding administrator privileges to Hadoop users to facilitate deployment and avoid some issues with insufficient permissions
sudo adduser hadoop sudo

Switch to Hadoop user login to install SSH server, configure SSH login without password

Ubuntu has the SSH client installed by default, and we need to install SSH server as well.
sudo apt-get install Openssh-server

Hadoop cluster needs SSH login without password, we set
CD ~/.ssh
ssh-keygen-t RSA #一直按回车就可以
CP Id_rsa.pub Authorized_keys

After Setup, we have no password to log on to this machine for testing
SSH localhost network configuration

In/etc/hosts, add the following cluster information:

192.168.1.103 WLW
192.168.1.105 zcq-pc

It is important to note that the cluster information needs to be added on all hosts (master and Slave)

General configuration JDK will not say, the only thing to note is to add Java_home in/ect/environment, otherwise error:

Export java_home=/opt/jdk1.7.0_75

Similarly, you need to add all hosts (master and Slave) SSH login node without password

Only by setting up a password-free login between the nodes can Hadoop implement the master node map task to the Slave node for distributed computing.
As already described above, a public key is generated on the master node, which is still repeated here.

First generate the master's public key and execute it in the Master node terminal:
CD ~/.ssh
SSH-KEYGEN-T RSA # always press ENTER to

The master node requires no password login (SSH) native, or executes the command on the master node:
Cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
You can test with SSH localhost when you are done

Transfer the public key from the master (WLW) node to the slave (ZCQ-PC) node:
SCP ~/.ssh/id_rsa.pub hadoop@zcq-pc:/home/hadoop/

On the Slave (zcq-pc) node, save the SSH key to the appropriate location:
Cat ~/id_rsa.pub >> ~/.ssh/authorized_keys

If you have other slave nodes, the same logic also needs to transfer the master node's public key to the other slave nodes, the same steps as above. Alternatively, you can generate a public key on the slave node and transfer the public key of the slave node to the master node so that they can log in with each other without a password.

Finally, test on the master (WLW) node to see if you can log on to the slave (ZCQ-PC) node without a password:
SSH zcq-pc Configuring the Cluster/Distributed environment (critical steps)

The cluster/Distributed mode needs to modify the 5 configuration files in the Etc/hadoop, and the latter four files can be clicked to view the official default settings, which only set the necessary settings for normal startup: Slaves, Core-site.xml, Hdfs-site.xml, Mapred-site.xml, Yarn-site.xml

Slaves file
Cd/opt/hadoop-2.7.2/etc/hadoop
Vim Slaves

Delete the original localhost, and write all slave host names on each line. Because I have only one slave (ZCQ-PC) node, there is only a single line of content in the file: zcq-pc

Core-site.xml file

<property>
</property>

Switch

<property>
<name>fs.defaultFS</name>
<value>hdfs://wlw:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/opt/ Hadoop-2.7.2/tmp</value>
<description>a Base for other temporary directories.</description>
</property>

Please refer to my configuration information and make changes based on your own path and the hostname name of the master node.

Hdfs-site.xml file

<property>
<name>dfs.namenode.secondary.http-address</name>
<value>wlw:50090 </value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop-2.7.2/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop-2.7.2/tmp/dfs/data</ value>
</property>
<property>
<name>dfs.replication</name>
< Value>2</value>
</property>

Wheredfs.replication is setting the number of backup files, default is 3, I set it to 2 here

Similarly, please refer to my configuration information and make changes based on your own path and the hostname name of the master node.

Mapred-site.xml file
This file does not exist, first you need to copy from the template: CP mapred-site.xml.template Mapred-site.xml

Then the configuration changes are as follows:

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

Make Hadoop based on yarn resource scheduling system

Yarn-site.xml file

<property>
<name>yarn.resourcemanager.hostname</name>
<value>wlw</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

Once configured, the Hadoop-2.7.2 files on the master (WLW) node are packaged and copied to each slave node:
Cd/opt
sudo tar-zcf./hadoop-2.7.2.tar.gz./hadoop-2.7.2
SCP./hadoop-2.7.2.tar.gz Zcq-pc:/home/hadoop

Execute on slave (ZCQ-PC) node:
sudo tar-zxf ~/hadoop-2.7.2.tar.gz-c/opt/
sudo chown-r hadoop:hadoop/opt/hadoop-2.7.2

After the configuration is complete, Hadoop can be started on the master node:
cd/opt/hadoop-2.7.2
./bin/hdfs Namenode-format # First run requires initialization, no longer required
./sbin/start-dfs.sh
./sbin/start-yarn.sh

To view the Hadoop process for the master (WLW) node through the JPS command:
You can see that the master node started the NameNode,secondrrynamenode,ResourceManager processes, and also the process of the JPS command

NameNode refers to the master node.

View the Hadoop process for the slave (ZCQ-PC) node through the JPS command:
The slave node initiates the DataNode and NodeManager processes, in addition to the process of JPS commands

DataNode refers to the slave node.

See the status of viewing Datanode and Namenode through a Web page:

http://wlw:50070/

Or you can also check it out via http://wlw:8088:
Executing WordCount instances

Create the file folder on your local hard disk:
mkdir ~/file

Enter the directory to create a file1.txt file:
CD file
echo "Hello Hadoop" > File1.txt

Create an Input folder directory on HDFS input:
cd/opt/hadoop-2.7.2
./bin/hadoop Fs-mkdir/input

Transfer the File1.txt file created on your local hard drive into input:
./bin/hadoop Fs-put ~/file/file1.txt/input

Run the WordCount example with a jar package that comes with Hadoop:
./bin/hadoop jar Share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount/input//output/wordcount1

To view the results of a run:
./bin/hdfs dfs-cat/output/wordcount1/*

Finally, the shutdown of the Hadoop cluster is also performed on the master node:
./sbin/stop-dfs.sh
./sbin/stop-yarn.sh Note change/etc/profile after you make sure it takes effect
Source/etc/profile because my JDK is placed in the/OPT directory, he should be given permission to execute.
sudo chmod u+x-r/opt/jdk1.7.0_75 issues that occur in starting Hadoop

(1) Warning:the ECDSA host key for "ZCQ-PC" differs from the key for the IP address "192.168.1.105"
Solution: Remove the cached key for "192.168.1.105" on Master machine

Ssh-keygen-r 192.168.1.105

(2) warning:possible DNS SPOOFING detected!
Solution: Delete the corresponding line in the. ssh/known_hosts file ("Offending key in/home/wlw/.ssh/known_hosts:" line number)

Vim Display line number:
: Set Number

PS: In the next blog post I'll really talk about configuring the spark cluster on top of Hadoop yarn

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.