Build a Spark+hdfs cluster under Docker

Source: Internet
Author: User
Tags docker run

build a Spark+hdfs cluster under Docker

1. Install the Ubuntu OS in the VM and enable root login

(http://jingyan.baidu.com/article/148a1921a06bcb4d71c3b1af.html)

Installing the VM Enhancement tool

Http://www.jb51.net/softjc/189149.html

2. Installing Docker

Docker installation Method One

ubuntu14.04 and above are all self-installing Docker packages, so they can be installed directly, but this is not the first version.

Sudoapt-get Updatesudoapt-get Install Docker.io


Docker installation method two; Install the latest version



The first step of the specific deployment process, the relevant software preparation.

Download JDK, Scala, spark, Hadoop compressed files and upload them to the virtual machine

Extract the above files into the/opt/directory

Operation instructions, directly to the Java and other software decompression to/opt, a total of four, Java,hadoop,scala,spark. There is no need to make any modifications to the host host, including adding variables such as/etc/hosts,/etc/profile, because it is used in the container, and the host host is not used. After decompression, the following:

[Email protected]:/opt#lltotal 32drwxr-xr-x 7 rootroot 4096 December 22:12/drwxr-xr-x 4096 rootroot November 30 19:35. /drwxr-xr-x rootroot 4096 December 22:07 hadoop-2.6.0/drwxr-xr-x 8 rootroot 4096 April jdk1.7.0_79/drwxr-xr-x 9 root Root 4096 December 13:54 scala-2.10.5/drwxrwxr-x rootroot 4096 December 22:19 spark-1.2.0-bin-hadoop2.4/


And then the Hadoop and Spark configuration file modification, this step is mainly based on the previous related operations, you can refer to the above two Web site modification configuration file, I am directly copy my previous cluster configuration file replacement, and then combined with the following host name, IP and other lines to modify the line. If you have not deployed a cluster before, this step is quite a lot of work.

One thing that needs special attention is the Dfs.datanode.data.dir in Hdfs-sit.xml in the configuration file for Hadoop, which is the Datanode folder of HDFs, which is also modified by small tricks to file:/root/data, and why the changes are explained later. The ultimate goal is to link files to the Hadoop directory of the data volume/opt, so that the data can be stored outside the container and will not be lost as the container is deleted. Modify the following:



The second step is to make the base image. (main work)

The idea of this cluster is to reduce the additional workload as much as possible, using a fixed network environment, which may be inconsistent with the network of Docker itself, so the use of a little bit of technical modification of the network, this method can not be used in large-scale use of the reason, but also a disadvantage of it. I see someone using dynamic IP registration, I have not yet understood which point, in the late learning to gradually improve it. The node container host name and IP plan are as follows:

Master node container hostname hostname:node0,ip:172.17.0.150.

Hostname:node1,ip:172.17.0.151 from the node container host name.

hostname:node2,ip:172.17.0.152 from the node container host name.

Here's how to set it up in one step:

1. View the image, use ubuntu:14.04 as the base image, if not, pull one.



2. Start a container and install vim and SSH. (–net=host should be added when starting up, otherwise the container will not be able to access the Internet)

[Email protected]:/opt# docker run-it ubuntu:14.04/bin/bash[email protected]:/# apt-get-y Install Vimopenssh-server


3. Modify the SSH configuration file to allow root login.

[Email protected]:/# vim/etc/ssh/sshd_config


Found: Permitrootlogin Without-password

Modified to: Permitrootlogin Yes

4. Generate the SSH public key, enter Ssh-keygen, and go straight on. What needs to be explained is that the public key of the three nodes is the same, and the small tricks are used for the sake of simplicity. If you know more about SSH, I'm talking about the equivalent of nonsense, and there will be some later.

[Email protected]:/# Ssh-keygen


At this point the/root/.ssh folder is as follows:



5. The key steps are started below.

Write the required variable to/ROOT/.BASHRC, why not write to/etc/profile, because I tried, write/etc/proflie when the boot container is generated, the variable does not take effect.

See here, believe that the following variables are very familiar with it:

6. This is the last step, in the/root under the new run.sh script, the changes made to the container, all write this script, first put the script, and then explain it.

1 #!/bin/bash 2 3 echo "172.17.0.150 Master" >/etc/hosts 4 echo "172.17.0.151 work1" >>/etc/hosts 5 echo "172.     17.0.152 Work2 ">>/etc/hosts 6 echo" 172.17.0.152 Work3 ">>/etc/hosts 7 Case$hostname in 8" master ") 9 Ifconfig eth0 172.17.0.15010 sed-i ' s/[email protected]*$/[email protected]/g '/root/.ssh/authorized_keys 11;; "Work1") ifconfig eth0 172.17.0.15114 sed-i ' s/[email protected]*$/[email protected]/g '/ROOT/.SS H/authorized_keys15 ln-s/opt/hadoop-2.6.0/dfs/node1/root/data16; "Work2") ifconfig eth0 172.17.0.15219 sed-i ' s/[email protected]*$/[email protected]/g '/ROOT/.SS H/AUTHORIZED_KEYS20 ln-s/opt/hadoop-2.6.0/dfs/node2/root/data21; "Work3") ifconfig eth0 172.17.0.15324 sed-i ' s/[email protected]*$/[email protected]/g '/ROOT/.SS H/AUTHORIZED_KEYS25 ln-s/opt/hadoop-2.6.0/dfs/node3/root/data26;   *) echo "null" 29  ;; ESAC31 32/etc/init.d/ssh start-d33


1) 3,4,5 line, replace hosts. When you start a cluster, you habitually prefer to use the hostname instead of using IP, so make this change. Another reason is that the hosts and IP will change after the container restarts, so each boot is modified.

2) 7 to 30 lines, is to make use of the container host name to do three changes.

First, modify the host's IP, that is, our three nodes are fixed IP, this command requires privileged.

Second, set up ssh-free login, and the last field in Authorized_keys [email protected] all changed to [email protected], so that the master node can be free to log on to WORK1,WORK2, Work3 and himself nmaster.

Third, using the connection file, the HDFs data is saved to the relevant directory of the data volume, that is, stored outside the container. When using a connection file, the HDFs configuration file is/root/data, but it is actually saved to a different file directory. A special modification made in the configuration file of Hadoop above <name>dfs.datanode.data.dir</name>,<value>file:/root/data</value This is the actual directory in which HDFs stores data, through a soft connection to the data volume directory, and finally the data is stored outside the container, so that when the container is deleted, the data in HDFs does not disappear, and the new container can use the data again.

3) 32 lines, this is the start of SSH, the key is-d This parameter, if not added, when the container is started run-d will stop, will not run.

4) Finally save the exit, and then modify the execution permissions, exit the container

[Email protected]:~# chmod 744/root/run.sh[email protected]:~# exit


7. Use commit to submit the image.

[Email protected]:~/docker# Docker commit 67ubuntu: Base-spark35341d63645cb5c23f88a6f4ac51d1000dc4431646ac3a948bd9c9f171dcbeeb[email protected]:~/docker# Docker Imagesrepository TAG IMAGE ID CREATED virtualsizeubuntu base-spark 35341d63645c 4 minutesago 261.1 MB


As can be seen from the above, the image is only 260MB, is very small.

This entire base image is done, and there is a possible error in the configuration file modification of Hadoop and Spark, which is a "preparation" that is irrelevant to docker knowledge.

----------

The third step is to start the container, start the cluster, and test.

A command, the cluster starts up.

In fact, the following most of the space is in the explanation of my ideas, start the cluster itself is very simple hadoop,spark knowledge.

----------

One, start the container cluster.

Script docker_start.sh, inside three lines is the command to start three containers, take a look first:

[Email Protected]:~/docker#cat docker_start.sh

#!/bin/bash

#master节点会有web控制台, so you have to configure the container's exposed port (format –p host port: Container Internal Port number)

Docker run-d--name master-h master-v/opt:/opt--privileged ubuntu:base-spark/root/run.sh Docker run-d–namework1-h W Ork1-v/opt:/opt--privileged ubuntu:base-spark/root/run.shdocker run-d--name work2-h work2-v/opt:/opt--privileged Ubuntu:base-spark/root/run.shdocker run-d--name work3-h work3-v/opt:/opt--privileged ubuntu:base-spark/root/run.sh


Here's an explanation of each of the parameters of this startup command:

1)-D This command can be successfully executed because run.sh the/etc/init.d/ssh of this script start-d the line-d parameter, the container can successfully background up.

2)--name Master, this is the container name of master.

3)-H master, where Master is the container hostname, which is hostname.

4)-v/opt:/opt, is the data volume, four directories java,hadoop,scala,spark here, and HDFS data storage directory in the hadoop-2.6.0 directory, DFS folder has three directories, it is best to manually advance the new Name,node1 and Node2, in fact, can be written in run.sh script built, but I do not want to go back to modify run.sh.

[Email Protected]:/opt/hadoop-2.6.0/dfs#pwd/opt/hadoop-2.6.0/dfs[email protected]:/opt/hadoop-2.6.0/dfs#lsname Node1 Node2node3


The name folder is specified by the Hadoop configuration file:

<property><name>dfs.namenode.name.dir</name><value>file:/opt/hadoop-2.6.0/dfs/name< /value>


Node1 and Node2 are directories of run.sh scripts that store data by connecting files past actual HDFs:

<property><name>dfs.datanode.data.dir</name><value>file:/root/data</value></ Property>


Ln-s/opt/hadoop-2.6.0/dfs/node1/root/dataln-s/opt/hadoop-2.6.0/dfs/node2/root/data


5)--privileged, this parameter is the highest permission to be able to execute the run.sh script inside the command to modify the IP.

Ifconfig eth0172.17.0.150


6)/root/run.sh, is the start of the container, the implementation of our pre-written script, the container has been modified, although these changes distort some of the features of Docker, but for our local small environment, should still be a bit of practical use of value.

----------

Second, enter the master container and start and test HDFs.

In fact, here, it's almost over, and here's the knowledge of Hadoop and spark.

First, let's take a look at the three nodes you started.



Enter Master container

[Email protected]:/#docker exec-it master/bin/bash[email protected]:/#


At this time the container has been modified, you can see the following relevant information, such as, Ifconfig,/etc/hosts,hostname,/root/.ssh/authorized_keys, and so on.

The following will start the Hadoop HDFs bar, because this is only used in HDFs so no matter yarn, first start HDFs first to format, and then, you have to enter a number of Yes, this yes is the first time SSH landing needs, I do not post out the formatting and other related code.

Then you start HDFs:



Enter JPS to see the process on NODE0

[Email protected]:/# jps1310 Jps843 Namenode1025secondarynamenode


The following can be used HDFs, you can upload several files to HDFs to try, or through the WebUI browser to see the situation of HDFs, in short, is the knowledge of HDFs, I will not nonsense.

----------

Third, start the Spark cluster in standalone mode.

Here you can start the spark process directly:

[Email protected]:/#/opt/spark-1.2.0-bin-hadoop2.4/sbin/start-all.sh

JPS again, look at the boot situation.



Everything is OK, you can start Spark-shell to test and start it in standalone mode:

[Email Protected]:/#/opt/spark-1.2.0-bin-hadoop2.4/bin/spark-shell--master spark://master:7077


It's basically over here, you can run a wordcount example, and you can also use WebUI to view the spark situation.


Data: Using Docker to build Hadoop,spark clusters locally

http://dockone.io/article/944

Build a Spark+hdfs cluster under Docker

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.