Preface
I recently contacted Spark and wanted to experiment with a small-scale spark distributed cluster in the lab. Although only with a single stand-alone version (standalone) of the pseudo-distributed cluster can also do experiments, but the sense of little meaning, but also in order to realistically restore the real production environment, after looking at some information, know that spark operation requires external resource scheduling system to support, mainly: standalone Deploy mode, Amazon EC2, Apache Mesos, Hadoop YARN. Because yarn was popular, I decided to configure the spark distributed cluster on top of Hadoop yarn.
For some of the more superficial, preliminary understandings of Hadoop and Spark, you can read my blog post, "Some superficial understanding of Hadoop and spark." Said the wrong place or there is a supplementary place also please all Bo friends to put forward ~ I will also in the future study constantly revised and updated.
This article is mainly about the cluster configuration under Ubuntu14.04 Hadoop2.7.2. Environment Introduction system: Ubuntu14.04 64-bit JDK version: JDK 1.7 Hadoop version: Hadoop 2.7.2
Cluster Environment:
role |
hostname |
IP |
Master |
Wlw |
192.168.1.103 |
Slave |
Zcq-pc |
192.168.1.105 |
Create a Hadoop user
It is important to note that the Hadoop cluster requires the same user name on each master and slave node. Here I use a unified user named "Hadoop".
If the user on your node is not unified, you can use the following command to create a new user:
Create user, named "Hadoop"
sudo adduser Hadoop
Set Password
sudo passwd Hadoop
Create a directory for Hadoop users to log in
sudo mkdir/home/hadoop
Change the owner of the specified directory to a Hadoop user
sudo chown hadoop/home/hadoop
Consider adding administrator privileges to Hadoop users to facilitate deployment and avoid some issues with insufficient permissions
sudo adduser hadoop sudo
Switch to Hadoop user login to install SSH server, configure SSH login without password
Ubuntu has the SSH client installed by default, and we need to install SSH server as well.
sudo apt-get install Openssh-server
Hadoop cluster needs SSH login without password, we set
CD ~/.ssh
ssh-keygen-t RSA #一直按回车就可以
CP Id_rsa.pub Authorized_keys
After Setup, we have no password to log on to this machine for testing
SSH localhost network configuration
In/etc/hosts, add the following cluster information:
192.168.1.103 WLW
192.168.1.105 zcq-pc
It is important to note that the cluster information needs to be added on all hosts (master and Slave)
General configuration JDK will not say, the only thing to note is to add Java_home in/ect/environment, otherwise error:
Export java_home=/opt/jdk1.7.0_75
Similarly, you need to add all hosts (master and Slave) SSH login node without password
Only by setting up a password-free login between the nodes can Hadoop implement the master node map task to the Slave node for distributed computing.
As already described above, a public key is generated on the master node, which is still repeated here.
First generate the master's public key and execute it in the Master node terminal:
CD ~/.ssh
SSH-KEYGEN-T RSA # always press ENTER to
The master node requires no password login (SSH) native, or executes the command on the master node:
Cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
You can test with SSH localhost when you are done
Transfer the public key from the master (WLW) node to the slave (ZCQ-PC) node:
SCP ~/.ssh/id_rsa.pub hadoop@zcq-pc:/home/hadoop/
On the Slave (zcq-pc) node, save the SSH key to the appropriate location:
Cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
If you have other slave nodes, the same logic also needs to transfer the master node's public key to the other slave nodes, the same steps as above. Alternatively, you can generate a public key on the slave node and transfer the public key of the slave node to the master node so that they can log in with each other without a password.
Finally, test on the master (WLW) node to see if you can log on to the slave (ZCQ-PC) node without a password:
SSH zcq-pc Configuring the Cluster/Distributed environment (critical steps)
The cluster/Distributed mode needs to modify the 5 configuration files in the Etc/hadoop, and the latter four files can be clicked to view the official default settings, which only set the necessary settings for normal startup: Slaves, Core-site.xml, Hdfs-site.xml, Mapred-site.xml, Yarn-site.xml
Slaves file
Cd/opt/hadoop-2.7.2/etc/hadoop
Vim Slaves
Delete the original localhost, and write all slave host names on each line. Because I have only one slave (ZCQ-PC) node, there is only a single line of content in the file: zcq-pc
Core-site.xml file
<property>
</property>
Switch
<property>
<name>fs.defaultFS</name>
<value>hdfs://wlw:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/opt/ Hadoop-2.7.2/tmp</value>
<description>a Base for other temporary directories.</description>
</property>
Please refer to my configuration information and make changes based on your own path and the hostname name of the master node.
Hdfs-site.xml file
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>wlw:50090 </value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop-2.7.2/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop-2.7.2/tmp/dfs/data</ value>
</property>
<property>
<name>dfs.replication</name>
< Value>2</value>
</property>
Wheredfs.replication is setting the number of backup files, default is 3, I set it to 2 here
Similarly, please refer to my configuration information and make changes based on your own path and the hostname name of the master node.
Mapred-site.xml file
This file does not exist, first you need to copy from the template: CP mapred-site.xml.template Mapred-site.xml
Then the configuration changes are as follows:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Make Hadoop based on yarn resource scheduling system
Yarn-site.xml file
<property>
<name>yarn.resourcemanager.hostname</name>
<value>wlw</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
Once configured, the Hadoop-2.7.2 files on the master (WLW) node are packaged and copied to each slave node:
Cd/opt
sudo tar-zcf./hadoop-2.7.2.tar.gz./hadoop-2.7.2
SCP./hadoop-2.7.2.tar.gz Zcq-pc:/home/hadoop
Execute on slave (ZCQ-PC) node:
sudo tar-zxf ~/hadoop-2.7.2.tar.gz-c/opt/
sudo chown-r hadoop:hadoop/opt/hadoop-2.7.2
After the configuration is complete, Hadoop can be started on the master node:
cd/opt/hadoop-2.7.2
./bin/hdfs Namenode-format # First run requires initialization, no longer required
./sbin/start-dfs.sh
./sbin/start-yarn.sh
To view the Hadoop process for the master (WLW) node through the JPS command:
You can see that the master node started the NameNode,secondrrynamenode,ResourceManager processes, and also the process of the JPS command
NameNode refers to the master node.
View the Hadoop process for the slave (ZCQ-PC) node through the JPS command:
The slave node initiates the DataNode and NodeManager processes, in addition to the process of JPS commands
DataNode refers to the slave node.
See the status of viewing Datanode and Namenode through a Web page:
http://wlw:50070/
Or you can also check it out via http://wlw:8088:
Executing WordCount instances
Create the file folder on your local hard disk:
mkdir ~/file
Enter the directory to create a file1.txt file:
CD file
echo "Hello Hadoop" > File1.txt
Create an Input folder directory on HDFS input:
cd/opt/hadoop-2.7.2
./bin/hadoop Fs-mkdir/input
Transfer the File1.txt file created on your local hard drive into input:
./bin/hadoop Fs-put ~/file/file1.txt/input
Run the WordCount example with a jar package that comes with Hadoop:
./bin/hadoop jar Share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount/input//output/wordcount1
To view the results of a run:
./bin/hdfs dfs-cat/output/wordcount1/*
Finally, the shutdown of the Hadoop cluster is also performed on the master node:
./sbin/stop-dfs.sh
./sbin/stop-yarn.sh Note change/etc/profile after you make sure it takes effect
Source/etc/profile because my JDK is placed in the/OPT directory, he should be given permission to execute.
sudo chmod u+x-r/opt/jdk1.7.0_75 issues that occur in starting Hadoop
(1) Warning:the ECDSA host key for "ZCQ-PC" differs from the key for the IP address "192.168.1.105"
Solution: Remove the cached key for "192.168.1.105" on Master machine
Ssh-keygen-r 192.168.1.105
(2) warning:possible DNS SPOOFING detected!
Solution: Delete the corresponding line in the. ssh/known_hosts file ("Offending key in/home/wlw/.ssh/known_hosts:" line number)
Vim Display line number:
: Set Number
PS: In the next blog post I'll really talk about configuring the spark cluster on top of Hadoop yarn