Deploy Hadoop cluster service in CentOS

Last Update:2017-01-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Deploy Hadoop cluster service in CentOS
GuideHadoop is a Distributed System infrastructure developed by the Apache Foundation. Hadoop implements a Distributed File System (HDFS. HDFS features high fault tolerance and is designed to be deployed on low-cost hardware. It also provides high throughput to access application data, suitable for applications with large datasets. HDFS relaxed the requirements of POSIX and can access the data in the file system in streaming form.

HDFS Architecture

I. Introduction to the Hadoop framework
The core design of the Hadoop framework is HDFS and MapReduce. HDFS provides storage for massive data, while MapReduce provides computing for massive data.
HDFS (Hadoop Distribution File System), known as the Hadoop distributed File System, has the following features:

HDFS can store at least 64 MB of data blocks, which is 4 kb ~ 32 KB blocks are much larger.

HDFS optimizes Throughput Based on latency. It can efficiently process Read Request streams for large files, but is not good at locating requests for many small files.

HDFS optimizes the general "one write, multiple read" workload.

Each storage node runs a process called DataNode, which manages all data blocks on the corresponding host. These storage nodes are coordinated by a master process called NameNode, which runs on an independent process.

Different from setting physical redundancy in a disk array to handle disk faults or similar policies, HDFS uses copies to handle faults. Each data block consisting of files is stored on multiple nodes in the collection group, HDFS NameNode constantly monitors reports from various DataNode.

1. MapReduce Working Principle
The client submits MapReduce jobs. jobtracker coordinates job running. jobtracker is a java application and its main class is JobTracker and tasktracker. Tasktracker is a java application and TaskTracker is the main class.
2. Hadoop advantages
Hadoop is a distributed computing platform that allows users to easily architecture and use. You can easily develop and run applications that process massive data on Hadoop. It has the following advantages:
High reliability: Hadoop's ability to store and process data by bit is trustworthy.
High scalability: Hadoop distributes data among available computer clusters and completes computing tasks. These cluster clusters can be easily expanded to thousands of nodes.
Efficiency: Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.
High Fault Tolerance: Hadoop can automatically save multiple copies of data and automatically reallocate failed tasks.
Low Cost: hadoop is open-source compared with All-in-One machines, commercial data warehouses, QlikView, Yonghong Z-Suite, and other data marketplaces, and the software cost of the project is greatly reduced.
Hadoop has a framework written in Java, so it is ideal to run on the Linux production platform. Applications on Hadoop can also be written in other languages, such as C ++.
Hadoop Website: http://hadoop.apache.org/
Ii. Prerequisites
Ensure that the configuration environment of each node in the Hadoop cluster is consistent. install java and configure ssh.
Lab environment:
Platform: xen vm
OS: CentOS 6.8
Software: hadoop-2.7.3-src.tar.gz, jdk-8u101-linux-x64.rpm
HostnameIP AddressOS versionHadoop roleNode rolelinux-node1192.168.0.89CentOS 6.8Masternamenodelinux-node2192.168.0.90CentOS slave-node3192.168.0.91CentOS 6.8Slavedatenodelinux-node4192.168.0.92CentOS 6.8 Slavedatenode
# Download the required software package and upload it to each node of the cluster.
Iii. cluster architecture and Installation1. Hosts file settings
# Modify the hosts file of each node in the Hadoop Cluster
[root@linux-node1 ~]# cat /etc/hosts127.0.0.1 localhost localhost.localdomain linux-node1192.168.0.89 linux-node1192.168.0.90 linux-node2192.168.0.91 linux-node3192.168.0.92 linux-node4
2. install java
# Upload the downloaded JDK (rpm package) to the server in advance, and then install
rpm -ivh jdk-8u101-linux-x64.rpmexport JAVA_HOME=/usr/java/jdk1.8.0_101/export PATH=$JAVA_HOME/bin:$PATH# java -versionjava version "1.8.0_101"Java(TM) SE Runtime Environment (build 1.8.0_101-b13)Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
3. Install hadoop
# Create a hadoop user and set sudo
[root@linux-node1 ~]# useradd hadoop && echo hadoop | passwd --stdin hadoop[root@linux-node1 ~]# echo "hadoopALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers[root@linux-node1 ~]# su - hadoop[hadoop@linux-node1 ~]$ cd /usr/local/src/[hadoop@linux-node1src]$wget http://apache.fayea.com/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz[hadoop@linux-node1 src]$ sudo tar zxvf hadoop-2.7.3.tar.gz -C /home/hadoop/ && cd /home/hadoop[hadoop@linux-node1 home/hadoop]$ sudo mv hadoop-2.7.3/ hadoop[hadoop@linux-node1 home/hadoop]$ sudo chown -R hadoop:hadoop hadoop/
# Add the binary directory of hadoop to the path variable and set the HADOOP_HOME environment variable
[hadoop@linux-node1 home/hadoop]$ export HADOOP_HOME=/home/hadoop/hadoop/[hadoop@linux-node1 home/hadoop]$ export PATH=$HADOOP_HOME/bin:$PATH
4. Create a hadoop directory
[hadoop@linux-node1 ~]$ mkdir -p /home/hadoop/dfs/{name,data}[hadoop@linux-node1 ~]$ mkdir -p /home/hadoop/tmp
# Node Storage Data Backup Directory
sudo mkdir -p /data/hdfs/{name,data}sudo chown -R hadoop:hadoop /data/
# The preceding operations must be performed on each node of the hadoop cluster.
5. SSH Configuration
# Set the cluster master node to log on to other nodes without a password
[hadoop@linux-node1 ~]$ ssh-keygen -t rsa[hadoop@linux-node1 ~]$ ssh-copy-id linux-node1@192.168.0.90[hadoop@linux-node1 ~]$ ssh-copy-id linux-node2@192.168.0.91[hadoop@linux-node1 ~]$ ssh-copy-id linux-node3@192.168.0.92
# Test ssh Logon
6. Modify the hadoop configuration file
File Location:/home/hadoop/etc/hadoop, file name: hadoop-env.sh, yarn-evn.sh, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml
(1) configure the hadoop-env.sh File
# In the hadoop installation path, go to the hadoop/etc/hadoop/directory and edit the hadoop-env.sh, modify JAVA_HOME to JAVA installation path
[hadoop@linux-node1 home/hadoop]$ cd hadoop/etc/hadoop/[hadoop@linux-node1 hadoop]$ egrep JAVA_HOME hadoop-env.sh# The only required environment variable is JAVA_HOME. All others are# set JAVA_HOME in this file, so that it is correctly defined on#export JAVA_HOME=${JAVA_HOME}export JAVA_HOME=/usr/java/jdk1.8.0_101/
(2) configure the yarn. sh File
Specifies the java Runtime Environment of the yran framework. This file is the configuration file of the runtime environment of the yarn framework. You need to modify the location of JAVA_HOME.
[hadoop@linux-node1 hadoop]$ grep JAVA_HOME yarn-env.sh# export JAVA_HOME=/home/y/libexec/jdk1.6.0/export JAVA_HOME=/usr/java/jdk1.8.0_101/
(3) configure the slaves File
Specify the DataNode data storage server to write the host names of all DataNode machines to this file, as shown below:
[hadoop@linux-node1 hadoop]$ cat slaveslinux-node2linux-node3linux-node4
Hadoop three operating modes
Local Independent mode: All Hadoop components, such as NameNode, DataNode, Jobtracker, and Tasktracker, are running in a java Process.
Pseudo-distributed mode: Each Hadoop component has a separate Java virtual machine, which communicates with each other through a network socket.
Fully Distributed mode: Hadoop is distributed across multiple hosts, and different components are installed on different Guest instances based on their work nature.
# Configure the fully distributed mode
(4) modify the core-site.xml file, add the code for the red area, pay attention to the content marked in blue
<configuration><property><name>fs.default.name</name><value>hdfs://linux-node1:9000</value></property><property><name>io.file.buffer.size</name><value>131072</value></property><property><name>hadoop.tmp.dir</name><value>file:/home/hadoop/tmp</value><description>Abase for other temporary directories.</description></property></configuration>
(5) Modify hdfs-site.xml files
<Configuration> <property> <name> dfs. namenode. secondary. http-address </name> <value> linux-node1: 9001 </value> <description> # view HDFS status on the web interface </description> </property> <name> dfs. namenode. name. dir </name> <value> file:/home/hadoop/dfs/name </value> </property> <name> dfs. datanode. data. dir </name> <value> file:/home/hadoop/dfs/data </value> </property> <name> dfs. replication </name> <value> 2 </value> <description> # each Block has 2 backups </description> </property> <name> dfs. webhdfs. enabled </name> <value> true </value> </property> </configuration>
(6) Modify mapred-site.xml
This is the configuration of mapreduce tasks. Because hadoop2.x uses the yarn framework, to achieve distributed deployment, you must configure it as yarn under the mapreduce. framework. name attribute. Mapred. map. tasks and mapred. reduce. tasks are the number of map and reduce tasks respectively.
[hadoop@linux-node1 hadoop]$ cp mapred-site.xml.template mapred-site.xml<configuration><property><name>mapreduce.framework.name</name><value>yarn</value></property><property><name>mapreduce.jobhistory.address</name><value>linux-node1:10020</value></property><property><name>mapreduce.jobhistory.webapp.address</name><value>linux-node1:19888</value></property></configuration>
(7) Configure node yarn-site.xml
# This file is related to the yarn architecture configuration
<?xml version="1.0"?><configuration><property><name>mapred.child.java.opts</name><value>-Xmx400m</value></property></configuration><?xml version="1.0"?><configuration><property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property><property><name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property><property><name>yarn.resourcemanager.address</name><value>linux-node1:8032</value></property><property><name>yarn.resourcemanager.scheduler.address</name><value>linux-node1:8030</value></property><property><name>yarn.resourcemanager.resource-tracker.address</name><value>linux-node1:8031</value></property><property><name>yarn.resourcemanager.admin.address</name><value>linux-node1:8033</value></property><property><name>yarn.resourcemanager.webapp.address</name><value>linux-node1:8088</value></property><property><name>yarn.nodemanager.resource.memory-mb</name><value>8192</value></property></configuration>
7. Copy hadoop to another node
scp -r /home/hadoop/hadoop/ 192.168.0.90:/home/hadoop/scp -r /home/hadoop/hadoop/ 192.168.0.91:/home/hadoop/scp -r /home/hadoop/hadoop/ 192.168.0.92:/home/hadoop/
8. Initializing NameNode with hadoop users in linux-node1
/home/hadoop/hadoop/bin/hdfs namenode –format
#echo $?#sudo yum –y install tree# tree /home/hadoop/dfs
9. Start hadoop
/home/hadoop/hadoop/sbin/start-dfs.sh/home/hadoop/hadoop/sbin/stop-dfs.sh
# View the process on the namenode Node
ps aux | grep --color namenode
# View processes on DataNode
ps aux | grep --color datanode

10. Start the yarn distributed computing framework
[hadoop@linux-node1 .ssh]$ /home/hadoop/hadoop/sbin/start-yarn.sh starting yarn daemons

# View processes on the NameNode Node
ps aux | grep --color resourcemanager
# View processes on DataNode nodes
ps aux | grep --color nodemanager
Note: start-dfs.sh and start-yarn.sh can be replaced by start-all.sh
/home/hadoop/hadoop/sbin/stop-all.sh/home/hadoop/hadoop/sbin/start-all.sh
11. Start the jobhistory service and check the mapreduce status.
# On the NameNode Node
[hadoop@linux-node1 ~]$ /home/hadoop/hadoop/sbin/mr-jobhistory-daemon.sh start historyserverstarting historyserver, logging to /home/hadoop/hadoop/logs/mapred-hadoop-historyserver-linux-node1.out
12. view the HDFS Distributed File System Status
/home/hadoop/hadoop/bin/hdfs dfsadmin –report
# View File blocks. A file consists of these blocks.
/home/hadoop/hadoop/bin/hdfs fsck / -files -blocks
13. View hadoop cluster status on the web page
View HDFS status: http: // 192.168.0.89: 50070/

View Hadoop cluster status: http: // 192.168.0.89: 8088/
Original address: http://www.linuxprobe.com/centos-deploy-hadoop-cluster.html ghost

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Deploy Hadoop cluster service in CentOS

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Deploy Hadoop cluster service in CentOS

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support