One of the core features of Docker is the ability to package any application, including Hadoop, into a docker image. This tutorial describes the detailed steps to quickly build a multi-node Hadoop cluster on a single machine using Docker. After discovering the problems with the current Hadoop on Docker project, the author developed a near-minimized Hadoop image and supported a Hadoop cluster that can quickly build any number of nodes.
I. Introduction to the Project
Github:kiwanlau/hadoop-cluster-docker
Building Hadoop clusters directly with machines is a painful process, especially for beginners. They have not started to run WordCount, it may be the problem toss to pieces. And not everyone has several machines, right? You can try to build with multiple virtual machines, provided you have a performance-leveraging machine.
My goal is to run the Hadoop cluster in a Docker container, so that Hadoop developers can quickly and easily build multi-node Hadoop clusters in their native machines. In fact, there are many implementations of this idea, but they are not ideal, they are either too large or too slow to use, or use third-party tools to make it overly complex. The following table is a list of some known Hadoop on Docker projects and their existing problems.
Project Image size problem sequenceiq/hadoop-docker:latest 1.491gb Mirror is too large to support only a single node SEQUENCEIQ/HADOOP-DOCKER:2.7.0    1.76 GB sequenceiq/hadoop-docker:2.60 1.624GB sequenceiq/ambari:latest 1.782gb Mirror too large, too slow to use, use complex sequenceiq/ambari:2.0.0 4.804gb sequenceiq/ambari:latest:1.70 4.761gb alvinhenrick/hadoop-mutinode 4.331gb Mirror is too large to build too slow, Add node trouble, there is a bug
My project referred to the Alvinhenrick/hadoop-mutinode project, but I did a lot of optimization and refactoring. The GitHub homepage of the Alvinhenrick/hadoop-mutinode Project and the blog address written by the author are as follows:
- Github:hadoop (YARN) Multinode Cluster with Docker
- Blog: Hadoop (YARN) Multinode Cluster with Docker
The following two tables are a comparison of the parameters of the Alvinhenrick/hadoop-mutinode project with my Kiwenlau/hadoop-cluster-docker project:
Mirror name build time mirror layer mirror size alvinhenrick/serf 258.213s 239.4mbalvinhenrick/hado Op-base 2236.055s 4.328gbalvinhenrick/hadoop-dn 51.959s 4.331gbalvinhenrick/hadoop-nn- DN 49.548s 4.331GB
Mirror name build time mirror layer mirror size KIWENLAU/SERF-DNSMASQ 509.46s 8 206.6 Mbkiwenlau/hadoop-ba Se 400.29s 7 775.4 mbkiwenlau/hadoop-master 5.41s 9 775.4 Mbkiwenlau/hadoop-slave 2.41s 8 775.4 MB
I have mainly optimized these points:
- Smaller mirror size
- Faster time to construct
- Fewer layers of mirrors
Change the number of Hadoop cluster nodes faster and more easily
In addition, adding nodes to the Alvinhenrick/hadoop-mutinode project requires manually modifying the Hadoop configuration file and then rebuilding the HADOOP-NN-DN image, and then modifying the container startup script to achieve the added functionality of the node. And I'm using a shell script to automate the conversation, and in less than 1 minutes, I can rebuild the hadoop-master image and run it now! This project launches 3-node Hadoop clusters by default and supports Hadoop clusters of any number of nodes.
In addition, starting Hadoop, running WordCount, and rebuilding the image are automated with Shell scripting. This makes the entire project use and development very convenient and quick.
Developing a test environment
- Operating systems: Ubuntu 14.04 and Ubuntu 12.04
- Kernel version: 3.13.0-32-generic
- Docker version: 1.5.0 and 1.6.2
Small partners, hard disk, insufficient memory, especially if the kernel version is too low will cause the run to fail.
Two. Introduction to Mirroring
A total of 4 images were developed for this project:
- Serf-dnsmasq
- Hadoop-base
- Hadoop-master
- Hadoop-slave
SERF-DNSMASQ Mirroring
- Based on ubuntu:15.04 (choose it because it's the smallest, not because it's up to date)
- Installing Serf:serf is a distributed Machine node management tool. It can dynamically discover all Hadoop cluster nodes.
- Install DNSMASQ:DNSMASQ as a lightweight DNS server. It can provide domain name resolution services for Hadoop clusters.
When the container starts, the IP of the master node is passed to all slave nodes. The serf will start immediately after the container is started. The serf agent on the slave node immediately discovers the master node (the master IP they Know), and the master node discovers all slave nodes immediately. Then, by exchanging information between them, all nodes can know the existence of all other nodes. (Everyone would know everyone). Serf The new node is discovered, the DNSMASQ is reconfigured and the DNSMASQ is restarted. So DNSMASQ is able to parse the domain name of all the nodes in the cluster. This process will take longer as the nodes increase, so if you configure more Hadoop nodes, you will need to test the serf to see if all nodes are discovered after the container is started and whether DNS can resolve all the node domain names. Wait a few moments to start Hadoop. The solution was proposed by Sequenceiq, a company focused on running Hadoop in Docker. Refer to this speech.
Hadoop-base Mirroring
- Based on SERF-DNSMASQ mirroring
- Installing the JDK (OpenJDK)
- Install Openssh-server, configure password-free SSH
- Install VIM: The sample can happily hit the code in the container.
- Install Hadoop 2.3.0: Install compiled Hadoop (2.5.2, 2.6.0, 2.7.0 are bigger than 2.3.0, so I'm too lazy to upgrade)
Also, please refer to my blog for steps to compile Hadoop.
If you need to re-develop my hadoop-base, you need to download the compiled hadoop-2.3.0 installation package and put it in the Hadoop-cluster-docker/hadoop-base/files directory. I compiled the 64-bit hadoop-2.3.0:
Http://pan.baidu.com/s/1sjFRaFz
In addition, I compiled the 64-bit Hadoop 2.5.2, 2.6.0, and 2.7.0, which are as follows:
- Hadoop-2.3.0:http://pan.baidu.com/s/1sjfrafz
- Hadoop-2.5.2:http://pan.baidu.com/s/1jgw24aa
- Hadoop-2.6.0:http://pan.baidu.com/s/1eqgvf2m
- Hadoop-2.7.0:http://pan.baidu.com/s/1c0hd0nu
Hadoop-master Mirroring
- Based on Hadoop-base mirroring
- Configure the master node for Hadoop
- Formatting Namenode
This step requires configuring the slaves file, and the slaves file needs to list the domain name or IP of all nodes. Therefore, the number of Hadoop nodes is not the same, slaves files are naturally different. Therefore, when you change the number of Hadoop cluster nodes, you need to modify the slaves file and then rebuild the hadoop-master image. I have written a resize-cluster.sh script automation process. It is easy to make changes to the number of Hadoop cluster nodes by just a given number of nodes as a script parameter. Because the hadoop-master image only does some configuration work, and does not need to download any files, the whole process is very fast, 1 minutes is enough.
Hadoop-slave Mirroring
- Based on Hadoop-base mirroring
- Configuring the Slave node for Hadoop
Image size Analysis
The following table shows the results of the sudo docker images operation:
repository TAG IMAGE ID CREATED VIRTUAL SIZEindex.alauda.cn/kiwenlau/hadoop-slave 0.1.0 d63869855c03 17 hours ago 777.4 MBindex.alauda.cn/kiwenlau/hadoop-master 0.1.0 7c9d32ede450 17 hours ago 777.4 MBindex.alauda.cn/kiwenlau/hadoop-base 0.1.0 5571bd5de58e 17 hours ago 777.4 mbindex.alauda.cn /kiwenlau/serf-dnsmasq &nbsP; 0.1.0 09ed89c24ee8 17 hours ago 206.7 MBubuntu 15.04 bd94ae587483 3 weeks ago 131.3 mb
The following conclusions are readily known:
- The SERF-DNSMASQ image adds 75.4MB based on the ubuntu:15.04 image
- The hadoop-base image adds 570.7MB based on the SERF-DNSMASQ image
- Hadoop-master and Hadoop-slave mirrors have little increase in size on the basis of hadoop-base mirroring
The following table shows some of the running results for sudo docker history index.alauda.cn/kiwenlau/hadoop-base:0.1.0
image created CREATED BY SIZE2039b9b81146 44 hours ago /bin/sh -c # (NOP) add multi:a93c971a49514e787 158.5 mbcdb620312f30 44 hours ago /bin/sh -c apt-get install -y openjdk-7-jdk 324.6 mbda7d10c790c1 44 hours ago /bin/sh -c apt-get install -y openssh-server 87.58 mbc65cb568defc 44 hours ago /bin/sh -c curl -Lso serf.zip https://dl.bint 14.46 MB3e22b3d72e33 44 hours ago /bin/sh -c apt-get update && apt-get install 60.89 MBb68f8c8d2140 3 weeks ago /bin/sh -c # (NOP) add file:d90f7467c470bfa9a3 131.3 mb
Know:
- Base image ubuntu:15.04 to 131.3MB
- 324.6MB Required to install OPENJDK
- 158.5MB to install Hadoop
- Ubuntu, OpenJDK, and Hadoop are all mirrors necessary, and all three take up a total of 614.4MB
As a result, I have developed a Hadoop image that is close to the smallest and the optimization space is already very small.
Three. 3-node Hadoop cluster setup Steps
1. Pull the image
sudo docker pull Index.alauda.cn/kiwenlau/hadoop-master:0.1.0sudo Docker pull Index.alauda.cn/kiwenlau/hadoop-slave : 0.1.0sudo Docker pull Index.alauda.cn/kiwenlau/hadoop-base:0.1.0sudo Docker pull index.alauda.cn/kiwenlau/ serf-dnsmasq:0.1.0
3-5 min ok~ You can also pull the mirror directly from my Dokcerhub repository so that you can skip the 2nd step:
sudo docker pull Kiwenlau/hadoop-master:0.1.0sudo Docker pull Kiwenlau/hadoop-slave:0.1.0sudo Docker pull kiwenlau/ Hadoop-base:0.1.0sudo Docker Pull kiwenlau/serf-dnsmasq:0.1.0
To view the downloaded image:
sudo docker images
Operation Result:
repository TAG IMAGE ID CREATED VIRTUAL SIZEindex.alauda.cn/kiwenlau/hadoop-slave 0.1.0 d63869855c03 17 hours ago 777.4 MBindex.alauda.cn/kiwenlau/hadoop-master 0.1.0 7c9d32ede450 17 hours ago 777.4 MBindex.alauda.cn/kiwenlau/hadoop-base 0.1.0 5571bd5de58e 17 hours ago 777.4 mbindex.alauda.cn /kiwenlau/serf-dnsmasq &nbsP; 0.1.0 09ed89c24ee8 17 hours ago 206.7 mb
Where Hadoop-base mirroring is based on SERF-DNSMASQ mirroring, hadoop-slave mirroring and Hadoop-master mirroring are based on hadoop-base mirroring. So in fact, 4 mirrors are 777.4MB.
2. Modify the image tag
sudo docker tag d63869855c03 kiwenlau/hadoop-slave:0.1.0sudo docker tag 7c9d32ede450 Kiwenlau/hadoop-master:0.1.0sudo Docker tag 5571bd5de58e kiwenlau/hadoop-base:0.1.0sudo docker tag 09ed89c24ee8 kiwenlau/serf-dnsmasq:0.1.0
View modify tag after mirroring:
sudo docker images
Operation Result:
repository TAG IMAGE ID CREATED VIRTUAL SIZEindex.alauda.cn/kiwenlau/hadoop-slave 0.1.0 d63869855c03 17 hours ago 777.4 MBkiwenlau/hadoop-slave 0.1.0 d63869855c03 17 hours ago 777.4 mbindex.alauda.cn/kiwenlau/ Hadoop-master 0.1.0 7c9d32ede450 17 hours ago 777.4&nbsP mbkiwenlau/hadoop-master 0.1.0 7c9d32ede450 17 hours ago 777.4 MBkiwenlau/hadoop-base 0.1.0 5571bd5de58e 17 hours ago 777.4 mbindex.alauda.cn/kiwenlau/hadoop-base 0.1.0 5571bd5de58e 17 hours ago 777.4 MBkiwenlau/serf-dnsmasq 0.1.0 09ed89c24ee8 17 hours ago 206.7 mbindex.alauda.cn/kiwenlau/serf-dnsmasq 0.1.0 09ed89c24ee8 17 hours ago 206.7 mb
The reason to modify the image is because I upload the image to Dockerhub by default, so dokerfile and the image name in the shell script are not alauada prefixes, sorry for this .... But change the tag is very quick drop. If directly download my dockerhub in the mirror, naturally do not need to modify the tag ... But Alauda image download speed is very fast ha ~
3. Download the source code
git clone https://github.com/kiwenlau/hadoop-cluster-docker
To prevent GitHub from being XX, I imported the code into the open source China git repository:
git clone http://git.oschina.net/kiwenlau/hadoop-cluster-docker
4. Running the container
CD hadoop-cluster-docker./start-container.sh
Operation Result:
Start Master Container...start slave1 Container...start slave2 container ... [Email protected]:~#
A total of 3 containers were opened, 1 master, and 2 slave. When the container is opened, it enters the root directory (/root) of the master container root user.
To view the files of master's root user home directory:
Ls
Operation Result:
HDFs run-wordcount.sh serf_log start-hadoop.sh start-ssh-serf.sh
Start-hadoop.sh is a shell script that starts Hadoop, and run-wordcount.sh is a shell script that runs wordcount to test whether the image is working properly.
5. Test whether the container starts normally (at this point it has entered the master container)
To view Hadoop cluster members:
Serf members
Operation Result:
Master.kiwenlau.com 172.17.0.65:7946 Alive slave1.kiwenlau.com 172.17.0.66:7946 Alive slave2.kiwenlau.com 172.17.0. 67:7946 Alive
If the result is missing a node, you can wait a moment and then execute the "serf members" command. Because the serf agent takes time to discover all nodes.
To test SSH:
SSH slave2.kiwenlau.com
Operation Result:
warning:permanently added ' slave2.kiwenlau.com,172.17.0.67 ' (ECDSA) to the list of known hosts. Welcome to Ubuntu 15.04 (gnu/linux 3.13.0-53-generic x86_64) * documentation:https://help.ubuntu.com/the programs Includ Ed with the Ubuntu system is free software;the exact distribution terms for each program is described in Theindividual F Iles In/usr/share/doc/*/copyright. Ubuntu comes with absolutely NO WARRANTY, to the extent permitted byapplicable law. [Email protected]:~#
Exit Slave2:
Exit
Operation Result:
Logoutconnection to slave2.kiwenlau.com closed.
If SSH fails, wait a moment to test again, because the DNSMASQ DNS server startup takes time. Once the test is successful, you can start the Hadoop cluster! In fact, you can not test, open the container and wait patiently for a minute!
6. Turn on Hadoop
./start-hadoop.sh
After SSH to slave2 in the previous step, please remember to return to master! Running results too much, ignoring, Hadoop startup speed depends on machine performance ....
7. Running WordCount
./run-wordcount.sh
Operation Result:
Input File1.txt:Hello hadoopinput file2.txt:Hello dockerwordcount output:docker 1Hadoop 1Hello 2
WordCount execution speed depends on machine performance ....
Four. N-node Hadoop cluster setup Steps
1. Preparatory work
- Refer to the second part: Download the image, modify the tag, download the source code
- Note that you can not download SERF-DNSMASQ, but it is best to download hadoop-base, because Hadoop-master is built based on Hadoop-base.
2. Rebuilding the Hadoop-master image
./resize-cluster.sh 5
- Don't worry, it'll be done in 1 minutes.
- You can set different positive integers for the resize-cluster.sh script as the number of arguments 1, 2, 3, 4, 5, 6 ...
3. Start the container
./start-container.sh 5
- You can set different positive integers for the resize-cluster.sh script as the number of arguments 1, 2, 3, 4, 5, 6 ...
- This parameter, it is best to be consistent with the parameters of the previous step:)
- If this parameter is larger than the parameters of the previous step, you start the node more, Hadoop does not recognize them.
- If this parameter is smaller than the parameter in the previous step, Hadoop feels that the node that started less is hung up.
4. Test work
- Refer to Part III 5~7: Test Container, open Hadoop, run WordCount
- Note that if the node is incremented, be sure to test the container before you turn on Hadoop, because serf may not have found all the nodes, and DNSMASQ's DNS server indicates that the service is not configured yet.
- Test wait time depends on machine performance ....
Quickly build multi-node Hadoop clusters based on Docker-verified