Deploy Hadoop cluster service in CentOS

Source: Internet
Author: User

Deploy Hadoop cluster service in CentOS
GuideHadoop is a Distributed System infrastructure developed by the Apache Foundation. Hadoop implements a Distributed File System (HDFS. HDFS features high fault tolerance and is designed to be deployed on low-cost hardware. It also provides high throughput to access application data, suitable for applications with large datasets. HDFS relaxed the requirements of POSIX and can access the data in the file system in streaming form.


HDFS Architecture

I. Introduction to the Hadoop framework

The core design of the Hadoop framework is HDFS and MapReduce. HDFS provides storage for massive data, while MapReduce provides computing for massive data.

HDFS (Hadoop Distribution File System), known as the Hadoop distributed File System, has the following features:

  • HDFS can store at least 64 MB of data blocks, which is 4 kb ~ 32 KB blocks are much larger.
  • HDFS optimizes Throughput Based on latency. It can efficiently process Read Request streams for large files, but is not good at locating requests for many small files.
  • HDFS optimizes the general "one write, multiple read" workload.
  • Each storage node runs a process called DataNode, which manages all data blocks on the corresponding host. These storage nodes are coordinated by a master process called NameNode, which runs on an independent process.
  • Different from setting physical redundancy in a disk array to handle disk faults or similar policies, HDFS uses copies to handle faults. Each data block consisting of files is stored on multiple nodes in the collection group, HDFS NameNode constantly monitors reports from various DataNode.
1. MapReduce Working Principle

The client submits MapReduce jobs. jobtracker coordinates job running. jobtracker is a java application and its main class is JobTracker and tasktracker. Tasktracker is a java application and TaskTracker is the main class.

2. Hadoop advantages

Hadoop is a distributed computing platform that allows users to easily architecture and use. You can easily develop and run applications that process massive data on Hadoop. It has the following advantages:

High reliability: Hadoop's ability to store and process data by bit is trustworthy.

High scalability: Hadoop distributes data among available computer clusters and completes computing tasks. These cluster clusters can be easily expanded to thousands of nodes.

Efficiency: Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.

High Fault Tolerance: Hadoop can automatically save multiple copies of data and automatically reallocate failed tasks.

Low Cost: hadoop is open-source compared with All-in-One machines, commercial data warehouses, QlikView, Yonghong Z-Suite, and other data marketplaces, and the software cost of the project is greatly reduced.

Hadoop has a framework written in Java, so it is ideal to run on the Linux production platform. Applications on Hadoop can also be written in other languages, such as C ++.

Hadoop Website: http://hadoop.apache.org/

Ii. Prerequisites

Ensure that the configuration environment of each node in the Hadoop cluster is consistent. install java and configure ssh.

Lab environment:

Platform: xen vm

OS: CentOS 6.8

Software: hadoop-2.7.3-src.tar.gz, jdk-8u101-linux-x64.rpm

HostnameIP AddressOS versionHadoop roleNode rolelinux-node1192.168.0.89CentOS 6.8Masternamenodelinux-node2192.168.0.90CentOS slave-node3192.168.0.91CentOS 6.8Slavedatenodelinux-node4192.168.0.92CentOS 6.8 Slavedatenode

# Download the required software package and upload it to each node of the cluster.

Iii. cluster architecture and Installation1. Hosts file settings

# Modify the hosts file of each node in the Hadoop Cluster

[root@linux-node1 ~]# cat /etc/hosts127.0.0.1 localhost localhost.localdomain linux-node1192.168.0.89 linux-node1192.168.0.90 linux-node2192.168.0.91 linux-node3192.168.0.92 linux-node4
2. install java

# Upload the downloaded JDK (rpm package) to the server in advance, and then install

rpm -ivh jdk-8u101-linux-x64.rpmexport JAVA_HOME=/usr/java/jdk1.8.0_101/export PATH=$JAVA_HOME/bin:$PATH# java -versionjava version "1.8.0_101"Java(TM) SE Runtime Environment (build 1.8.0_101-b13)Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
3. Install hadoop

# Create a hadoop user and set sudo

[root@linux-node1 ~]# useradd hadoop && echo hadoop | passwd --stdin hadoop[root@linux-node1 ~]# echo "hadoopALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers[root@linux-node1 ~]# su - hadoop[hadoop@linux-node1 ~]$ cd /usr/local/src/[hadoop@linux-node1src]$wget http://apache.fayea.com/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz[hadoop@linux-node1 src]$ sudo tar zxvf hadoop-2.7.3.tar.gz -C /home/hadoop/ && cd /home/hadoop[hadoop@linux-node1 home/hadoop]$ sudo mv hadoop-2.7.3/ hadoop[hadoop@linux-node1 home/hadoop]$ sudo chown -R hadoop:hadoop hadoop/

# Add the binary directory of hadoop to the path variable and set the HADOOP_HOME environment variable

[hadoop@linux-node1 home/hadoop]$ export HADOOP_HOME=/home/hadoop/hadoop/[hadoop@linux-node1 home/hadoop]$ export PATH=$HADOOP_HOME/bin:$PATH
4. Create a hadoop directory
[hadoop@linux-node1 ~]$ mkdir -p /home/hadoop/dfs/{name,data}[hadoop@linux-node1 ~]$ mkdir -p /home/hadoop/tmp

# Node Storage Data Backup Directory

sudo mkdir -p /data/hdfs/{name,data}sudo chown -R hadoop:hadoop /data/

# The preceding operations must be performed on each node of the hadoop cluster.

5. SSH Configuration

# Set the cluster master node to log on to other nodes without a password

[hadoop@linux-node1 ~]$ ssh-keygen -t rsa[hadoop@linux-node1 ~]$ ssh-copy-id linux-node1@192.168.0.90[hadoop@linux-node1 ~]$ ssh-copy-id linux-node2@192.168.0.91[hadoop@linux-node1 ~]$ ssh-copy-id linux-node3@192.168.0.92

# Test ssh Logon

6. Modify the hadoop configuration file

File Location:/home/hadoop/etc/hadoop, file name: hadoop-env.sh, yarn-evn.sh, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml

(1) configure the hadoop-env.sh File

# In the hadoop installation path, go to the hadoop/etc/hadoop/directory and edit the hadoop-env.sh, modify JAVA_HOME to JAVA installation path

[hadoop@linux-node1 home/hadoop]$ cd hadoop/etc/hadoop/[hadoop@linux-node1 hadoop]$ egrep JAVA_HOME hadoop-env.sh# The only required environment variable is JAVA_HOME. All others are# set JAVA_HOME in this file, so that it is correctly defined on#export JAVA_HOME=${JAVA_HOME}export JAVA_HOME=/usr/java/jdk1.8.0_101/

(2) configure the yarn. sh File

Specifies the java Runtime Environment of the yran framework. This file is the configuration file of the runtime environment of the yarn framework. You need to modify the location of JAVA_HOME.

[hadoop@linux-node1 hadoop]$ grep JAVA_HOME yarn-env.sh# export JAVA_HOME=/home/y/libexec/jdk1.6.0/export JAVA_HOME=/usr/java/jdk1.8.0_101/

(3) configure the slaves File

Specify the DataNode data storage server to write the host names of all DataNode machines to this file, as shown below:

[hadoop@linux-node1 hadoop]$ cat slaveslinux-node2linux-node3linux-node4

Hadoop three operating modes

Local Independent mode: All Hadoop components, such as NameNode, DataNode, Jobtracker, and Tasktracker, are running in a java Process.

Pseudo-distributed mode: Each Hadoop component has a separate Java virtual machine, which communicates with each other through a network socket.

Fully Distributed mode: Hadoop is distributed across multiple hosts, and different components are installed on different Guest instances based on their work nature.

# Configure the fully distributed mode

(4) modify the core-site.xml file, add the code for the red area, pay attention to the content marked in blue

<configuration><property><name>fs.default.name</name><value>hdfs://linux-node1:9000</value></property><property><name>io.file.buffer.size</name><value>131072</value></property><property><name>hadoop.tmp.dir</name><value>file:/home/hadoop/tmp</value><description>Abase for other temporary directories.</description></property></configuration>

(5) Modify hdfs-site.xml files

<Configuration> <property> <name> dfs. namenode. secondary. http-address </name> <value> linux-node1: 9001 </value> <description> # view HDFS status on the web interface </description> </property> <name> dfs. namenode. name. dir </name> <value> file:/home/hadoop/dfs/name </value> </property> <name> dfs. datanode. data. dir </name> <value> file:/home/hadoop/dfs/data </value> </property> <name> dfs. replication </name> <value> 2 </value> <description> # each Block has 2 backups </description> </property> <name> dfs. webhdfs. enabled </name> <value> true </value> </property> </configuration>

(6) Modify mapred-site.xml

This is the configuration of mapreduce tasks. Because hadoop2.x uses the yarn framework, to achieve distributed deployment, you must configure it as yarn under the mapreduce. framework. name attribute. Mapred. map. tasks and mapred. reduce. tasks are the number of map and reduce tasks respectively.

[hadoop@linux-node1 hadoop]$ cp mapred-site.xml.template mapred-site.xml<configuration><property><name>mapreduce.framework.name</name><value>yarn</value></property><property><name>mapreduce.jobhistory.address</name><value>linux-node1:10020</value></property><property><name>mapreduce.jobhistory.webapp.address</name><value>linux-node1:19888</value></property></configuration>

(7) Configure node yarn-site.xml

# This file is related to the yarn architecture configuration

<?xml version="1.0"?><!-- mapred-site.xml --><configuration><property><name>mapred.child.java.opts</name><value>-Xmx400m</value><!--Not marked as final so jobs can include JVM debuggung options --></property></configuration><?xml version="1.0"?><!-- yarn-site.xml --><configuration><!-- Site specific YARN configuration properties --><property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property><property><name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property><property><name>yarn.resourcemanager.address</name><value>linux-node1:8032</value></property><property><name>yarn.resourcemanager.scheduler.address</name><value>linux-node1:8030</value></property><property><name>yarn.resourcemanager.resource-tracker.address</name><value>linux-node1:8031</value></property><property><name>yarn.resourcemanager.admin.address</name><value>linux-node1:8033</value></property><property><name>yarn.resourcemanager.webapp.address</name><value>linux-node1:8088</value></property><property><name>yarn.nodemanager.resource.memory-mb</name><value>8192</value></property></configuration>
7. Copy hadoop to another node
scp -r /home/hadoop/hadoop/ 192.168.0.90:/home/hadoop/scp -r /home/hadoop/hadoop/ 192.168.0.91:/home/hadoop/scp -r /home/hadoop/hadoop/ 192.168.0.92:/home/hadoop/
8. Initializing NameNode with hadoop users in linux-node1
/home/hadoop/hadoop/bin/hdfs namenode –format

#echo $?#sudo yum –y install tree# tree /home/hadoop/dfs

9. Start hadoop
/home/hadoop/hadoop/sbin/start-dfs.sh/home/hadoop/hadoop/sbin/stop-dfs.sh

# View the process on the namenode Node

ps aux | grep --color namenode

# View processes on DataNode

ps aux | grep --color datanode


10. Start the yarn distributed computing framework
[hadoop@linux-node1 .ssh]$ /home/hadoop/hadoop/sbin/start-yarn.sh starting yarn daemons


# View processes on the NameNode Node

ps aux | grep --color resourcemanager

# View processes on DataNode nodes

ps aux | grep --color nodemanager

Note: start-dfs.sh and start-yarn.sh can be replaced by start-all.sh

/home/hadoop/hadoop/sbin/stop-all.sh/home/hadoop/hadoop/sbin/start-all.sh

11. Start the jobhistory service and check the mapreduce status.

# On the NameNode Node

[hadoop@linux-node1 ~]$ /home/hadoop/hadoop/sbin/mr-jobhistory-daemon.sh start historyserverstarting historyserver, logging to /home/hadoop/hadoop/logs/mapred-hadoop-historyserver-linux-node1.out
12. view the HDFS Distributed File System Status
/home/hadoop/hadoop/bin/hdfs dfsadmin –report

# View File blocks. A file consists of these blocks.

/home/hadoop/hadoop/bin/hdfs fsck / -files -blocks

13. View hadoop cluster status on the web page

View HDFS status: http: // 192.168.0.89: 50070/


View Hadoop cluster status: http: // 192.168.0.89: 8088/

Original address: http://www.linuxprobe.com/centos-deploy-hadoop-cluster.html ghost


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.