Building and developing of Hadoop distributed environment based on CentOS _linux

Source: Internet
Author: User
Tags builtin garbage collection shuffle ssh centos hdfs dfs

First of all, to illustrate the point is that I do not want to repeat the invention of the wheel. If you want to build a Hadoop environment, there are a lot of detailed steps and command code on the Web, and I don't want to repeat the record.

Secondly, I would like to say that I am also a novice, not very familiar with Hadoop. But just want to actually build a good environment, see his true colors, okay, good, better see. When running WordCount word frequency statistics, it is an exclamation of Hadoop has been distributed to do so well, even without the experience of distributed people, but also need to do some configuration to run a distributed cluster environment.

All right, that's the word.

Some of the things you need to know to build a Hadoop environment:

1.hadoop running on Linux system, you have to install Linux operating system

2. You need to build a cluster running Hadoop, such as a Linux system with access to each other on the LAN

3. In order to achieve the mutual access between the cluster, you need to do SSH without key login

4.hadoop running on the JVM, which means you need to install the Java JDK and configure the Java_home

The components of 5.hadoop are configured through XML. After you download a good Hadoop on the official web, unzip and modify the corresponding configuration file in the/etc/hadoop directory

工欲善其事, its prerequisite. Here's what you can say about the software and tools you use to build a Hadoop environment:

1.virtualbox--after all to simulate several Linux, the condition is limited, create several virtual machine building in VirtualBox

2.centos--download CentOS7 ISO image, load into VirtualBox, install run

3.securecrt--can ssh remote access Linux software

4.winscp--implementation of Windows and Linux communication

5.JDK for Linux--oracle official online download, after decompression configuration can be

6.hadoop2.7.1--can be downloaded from the Apache official web site

Okay, here's a three-step explanation.

Linux Environment Preparation

Configure IP

In order to achieve communication between native and virtual machines and virtual machines and virtual machines, the CentOS connection mode in VirtualBox is Host-only mode, and IP is set manually, noting that the gateway of the virtual machine and the IP address of the HOST-ONLY network in this machine are the same. After you configure IP, restart the network service to make the configuration valid. Three Linux is built here, as shown in the following figure

Configuring host Names

Set host name Hadoop01 for 192.168.56.101. and configure the cluster's IP and host name in the Hosts file. The remaining two hosts operate like this

[Root@hadoop01 ~]# cat/etc/sysconfig/network 
# Created by anaconda 
networking = yes 
HOSTNAME = Hadoop01   
[root@hadoop01 ~]# cat/etc/hosts 
127.0.0.1  localhost localhost.localdomain Localhost4.localdomain4 
:: 1     localhost localhost.localdomain localhost6 localhost6.localdomain6 
192.168.56.101 hadoop01 
192.168.56.102 hadoop02 

Permanently shut down the firewall

Service iptables Stop (1. The next time you restart the machine, the firewall will start again, so you need to permanently close the command of the firewall, 2 because of the CentOS 7, the command to turn off the firewall is as follows)

Systemctl Stop Firewalld.service    #停止firewall
systemctl Disable Firewalld.service #禁止firewall开机启动

Turn off the SELinux protection system

Changed to Disabled. Reboot reboot the machine for the configuration to take effect

[Root@hadoop02 ~]# Cat/etc/sysconfig/selinux # This file controls the state of SELinux on the 
 
system 
# selinux= C A take one of these three values: 
#   Enforcing-selinux security Policy is enforced 
 
#   Permissive-selinu X prints warnings instead of enforcing 
#   disabled-no SELinux policy is loaded 
selinux=disabled 
# Selinu Xtype= can take one of three two values: 
# targeted-targeted processes are protected   , 
#   Minimum-mo Dification of targeted policy only selected processes are protected 
#   Mls-multi level security protection
   
    selinuxtype=targeted  
   

Cluster SSH password-free login

Set up SSH key first

SSH-KEYGEN-T RSA 

Copy SSH key to three machines

Ssh-copy-id 192.168.56.101 
Ssh-copy-id 192.168.56.103

So if HADOOP01 's machine wants to log in hadoop02, just enter SSH hadoop02

 
 

Configuring JDK

Here in/home loyalty create three folders

tools--Store Kits

softwares--Storage Software

data--Storing data

Upload the downloaded Linux jdk to Hadoop01 's/home/tools via WINSCP

Extract JDK into softwares

 
 

The JDK home directory is visible in/home/softwares/jdk.x.x.x, the copy of the directory is pasted into the/etc/profile file and set in the file Java_home

Export java_home=/home/softwares/jdk0_111 

Save changes, perform source/etc/profile to make configuration effective

To see if the Java JDK was successfully installed:

 
 

You can copy the files that are set in the current node to another node

 
 

Hadoop cluster installation

Cluster planning is as follows:

101 nodes as HDFs Namenode, the remainder as datanode;102 as yarn ResourceManager, and the rest as NodeManager. 103 as Secondarynamenode. Start Jobhistoryserver and Webappproxyserver at Nodes 101 and 102, respectively.

Download hadoop-2.7.3

and place it in the/home/softwares folder. Because Hadoop requires the installation environment for JDK, first configure the/etc/hadoop/hadoop-env.sh java_home

(PS: I feel the JDK version is too high)

Next, in turn, modify the XML for the corresponding component of Hadoop

Modify Core-site.xml:

Specify Namenode Address

Modify the cache directory for Hadoop

The garbage collection mechanism of Hadoop

<configuration> 
  <property> 
    <name>fsdefaultFS</name> 
    <value>hdfs:// 101:8020</value> 
  </property> 
  <property> 
    <name>hadooptmpdir</name> 
    <value>/home/softwares/hadoop-3/data/tmp</value> 
  </property> 
  <property> 
    <name>fstrashinterval</name> 
    <value>10080</value> 
  </property> 
   

Hdfs-site.xml

Set the number of backups

Turn off permissions

To set the HTTP access interface

Set the IP address of the secondary Namenode

<configuration> 
  <property> 
    <name>dfsreplication</name> 
    <value>3</ value> 
  </property> 
  <property> 
    <name>dfspermissionsenabled</name> 
    <value>false</value> 
  </property> 
  <property> 
    <name> dfsnamenodehttp-address</name> 
    <value>101:50070</value> 
  </property> 
  <property> 
    <name>dfsnamenodesecondaryhttp-address</name> 
    <value>103:50090 </value> 
  </property> 
</configuration> 

Modify Mapred-site.xml.template name for Mapred-site.xml

Specifies that the MapReduce frame is yarn and is scheduled by yarn

Specify Jobhitory

Specify the Jobhitory Web port

Open Uber mode-this is for mapreduce optimization

<configuration> 
  <property> 
    <name>mapreduceframeworkname</name> 
    <value >yarn</value> 
  </property> 
  <property> 
    <name>mapreducejobhistoryaddress </name> 
    <value>101:10020</value> 
  </property> 
  <property> 
    < name>mapreducejobhistorywebappaddress</name> 
    <value>101:19888</value> 
  </ property> 
  <property> 
    <name>mapreducejobubertaskenable</name> 
    <value> True</value> 
  </property> 

Modify Yarn-site.xml

Specify MapReduce as Shuffle

Specify 102 nodes as ResourceManager

Specify a 102-node security agent

Turn on yarn Log

Specify Yarn Log deletion time

Specify the NodeManager memory: 8G

Specify the cpu:8 nucleus of the NodeManager

<configuration> <!--Site specific YARN configuration Properties--> <property> <name> yarnnodemanageraux-services</name> <value>mapreduce_shuffle</value> </property> <pro perty> <name>yarnresourcemanagerhostname</name> <value>102</value> &LT;/PROPERTY&G 
  T <property> <name>yarnweb-proxyaddress</name> <value>102:8888</value> </prope 
  rty> <property> <name>yarnlog-aggregation-enable</name> <value>true</value> </property> <property> <name>yarnlog-aggregationretain-seconds</name> <value&gt ;604800</value> </property> <property> <name>yarnnodemanagerresourcememory-mb</name& 
    Gt <value>8192</value> </property> <property> <name>yarnnodemanagerresourcecpu-vcore S</name> <value>8</value> </property> </configuration>  

Configure Slaves

Specify the compute node, that is, the node running Datanode and NodeManager

192.168.56.101
192.168.56.102
192.168.56.103

First in the Namenode node format, that is, 101 node execution:

Go to the Hadoop home directory: cd/home/softwares/hadoop-3

Execute the Hadoop script in the bin directory: Bin/hadoop Namenode-format

The presence of successful format is a success (PS, here is the image of others, do not mind ha)

When the above configuration is complete, copy it to another machine

Hadoop Environment Test

Enter the Hadoop home directory to execute the corresponding script file

JPS Command--java Virtual Machine process Status, showing the running Java process

Open the HDFs on the Namenode node 101 machine

[Root@hadoop01 hadoop-3]# sbin/start-dfssh Java HotSpot (TM) Client VM warning:you have loaded library/home/softwares/ Hadoop-3/lib/native/libhadoopso which might have disabled stack guard the VM would try to fix the stack guard now It ' s HIG Hly recommended that you fix the library with ' execstack-c <libfile> ', or link it with '-Z noexecstack ' 16/11/07 1  6:49:19 WARN utilnativecodeloader:unable to load Native-hadoop library for your platform using Builtin-java classes where Applicable starting namenodes on [HADOOP01] hadoop01:starting namenode, logging To/home/softwares/hadoop-3/logs/hadoo P-root-namenode-hadoopout 102:starting Datanode, logging to/home/softwares/hadoop-3/logs/ Hadoop-root-datanode-hadoopout 103:starting Datanode, logging to/home/softwares/hadoop-3/logs/ Hadoop-root-datanode-hadoopout 101:starting Datanode, logging to/home/softwares/hadoop-3/logs/ Hadoop-root-datanode-hadoopout starting secondary namenodes [HADOOP03] hadoop03:starting SecondarynamEnode, logging To/home/softwares/hadoop-3/logs/hadoop-root-secondarynamenode-hadoopout  

At this point the 101 node executes JPS and you can see that Namenode and Datanode have started

[Root@hadoop01 hadoop-3]# JPS 
7826 JPS 
7270 

JPS is executed at the 102 and 103 nodes, you can see that Datanode has started

[Root@hadoop02 bin]# JPS 
4260 datanode 
4488 JPS 
 
[root@hadoop03 
~]# JPS 6436 6750 Jps 

Start yarn

Execute at 102 node

[Root@hadoop02 hadoop-3]# sbin/start-yarnsh  
starting Yarn daemons starting ResourceManager 
, logging to/home/ Softwares/hadoop-3/logs/yarn-root-resourcemanager-hadoopout 
101:starting NodeManager, logging to/home/ Softwares/hadoop-3/logs/yarn-root-nodemanager-hadoopout 
103:starting NodeManager, logging to/home/softwares/ Hadoop-3/logs/yarn-root-nodemanager-hadoopout 

JPS View each node:

[Root@hadoop02 hadoop-3]# JPS 
4641 ResourceManager 4260 datanode 4765 nodemanager 5165 
 
 
[ ROOT@HADOOP01 hadoop-3]# JPS 
7270 datanode 8375 JPS 7976 nodemanager 7052 namenode 
 
 
[ ROOT@HADOOP03 ~]# JPS 
6915 nodemanager 6436 secondarynamenode 7287 JPS 
 

Start the jobhistory and protection process of the corresponding node separately

[Root@hadoop01 hadoop-3]# sbin/mr-jobhistory-daemonsh start historyserver 
starting historyserver, logging To/home /softwares/hadoop-3/logs/mapred-root-historyserver-hadoopout 
[root@hadoop01 hadoop-3]# JPS 
8624 
JPS 7270 datanode 
7976 nodemanager 
8553 jobhistoryserver 7052 
namenode 
 
[root@hadoop02 hadoop-3]# sbin/ Yarn-daemonsh start ProxyServer 
starting ProxyServer, logging to/home/softwares/hadoop-3/logs/ Yarn-root-proxyserver-hadoopout 
[root@hadoop02 hadoop-3]# JPS 
4641 ResourceManager 4260 datanode 
5367 webappproxyserver 
5402 Jps 

View node status through the browser on the HADOOP01 node, that is, the 101 node

HDFs Upload File

 
 

Running the WordCount program

[Root@hadoop01 hadoop-3]# bin/hadoop jar Share/hadoop/mapreduce/hadoop-mapreduce-examples-jar WORDCOUNT/PROFILE/FLL _out Java HotSpot (TM) Client VM warning:you have loaded LIBRARY/HOME/SOFTWARES/HADOOP-3/LIB/NATIVE/LIBHADOOPSO which mi  Ght have disabled stack guard the VM would try to fix the stack guard now It ' s highly recommended the fix the library With ' execstack-c <libfile> ', or link it with '-Z noexecstack ' 16/11/07 17:17:10 WARN utilnativecodeloader:unabl E to load Native-hadoop library for your platform using Builtin-java classes where applicable 16/11/07 17:17:12 INFO Clie Ntrmproxy:connecting to ResourceManager at/102:8032 16/11/07 17:17:18 INFO inputfileinputformat:total input paths to P Rocess:1 16/11/07 17:17:19 Info mapreducejobsubmitter:number of splits:1 16/11/07 17:17:19 INFO mapreduceJobSubmitter : Submitting tokens for job:job_1478509135878_0001 16/11/07 17:17:20 INFO implyarnclientimpl:submitted application appl 
ication_1478509135878_000116/11/07 17:17:20 INFO mapreducejob:the URL to track the JOB:HTTP://102:8888/PROXY/APPLICATION_1478509135878_0001/16/1 1/07 17:17:20 Info mapreducejob:running job:job_1478509135878_0001 16/11/07 17:18:34 info mapreducejob:job job_1478509 135878_0001 running in Uber mode:true 16/11/07 17:18:35 info mapreducejob:map 0% reduce 0% 16/11/07 17:18:43 info Map Reducejob:map 100% reduce 0% 16/11/07 17:18:50 info mapreducejob:map 100% reduce 100% 16/11/07 17:18:55 info mapreduce  Job:job job_1478509135878_0001 completed successfully 16/11/07 17:18:59 INFO mapreducejob:counters:52 File System Counters File:number of bytes read=4264 file:number of bytes written=6412 File:number of RE Ad operations=0 File:number of large read operations=0 File:number of write Operations=0 HDFS 
        : Number of bytes read=3940 hdfs:number of bytes written=261673 Hdfs:number of Read operations=35 Hdfs:number of LarGE Read operations=0 hdfs:number of write operations=8 Job counters launched map Tasks=1 Launched reduce Tasks=1 Other local map Tasks=1 total time spent by all maps in occupied slots (ms) =824 6 total time spent by all reduces in occupied slots (ms) =7538 total_launched_ubertasks=2 Num_ub  Er_submaps=1 Num_uber_subreduces=1 Total time spent from all map tasks (ms) =8246 total time spent By all reduce tasks (MS) =7538 total Vcore-milliseconds taken by all map tasks=8246 Total vcore-millisec Onds taken by all reduce tasks=7538 total megabyte-milliseconds taken by all map tasks=8443904 Total Meg Abyte-milliseconds taken by all reduce tasks=7718912 map-reduce Framework map Input records=78 map Output records=256 map output bytes=2605 map output materialized bytes=2116 Input split bytes=9 9 Combine Input records=256 Combine output records=156 reduce input groups=156 reduce shuffle bytes=2116 
        Reduce input records=156 Reduce output records=156 spilled records=312 shuffled Maps =1 
        Failed Shuffles=0 merged Map Outputs=1 GC time Elapsed (ms) =870 CPU Time Spent (ms) =1970 Physical memory (bytes) snapshot=243326976 Virtual Memory (bytes) snapshot=2666557440 total com 
        mitted heap Usage (bytes), =256876544 Shuffle Errors bad_id=0 connection=0 io_error=0 
    Wrong_length=0 wrong_map=0 wrong_reduce=0 File Input Format counters Bytes read=1829  File Output Format Counters Bytes written=1487

View running status through yarn in the browser

Look at the last word frequency statistic results

View the HDFs file system in the browser

[Root@hadoop01 hadoop-3]# Bin/hdfs dfs-cat/fll_out/part-r-00000 Java HotSpot (TM) Client VM warning:you have loaded Li Brary/home/softwares/hadoop-3/lib/native/libhadoopso which might have disabled stack guard the VM would try to fix the STA CK Guard Now it's highly recommended that your fix the library with ' execstack-c <libfile> ', or link It with ' z ' no Execstack ' 16/11/07 17:29:17 WARN utilnativecodeloader:unable to load Native-hadoop library for your platform using Buil  
Tin-java classes where applicable!= 1 "$-" 1 "$" 1 "$EUID" 2 "$HISTCONTROL" 1 "$i" 3 "${-#*i}" 1 "0" 1 
": ${path}:" 1 "' ID 2" after "1" ignorespace "1 # $UID 1 && 1 () 1 *) 1 *:" $:*) 1 -F 1-gn ' "1-gt 1-r 1-ru ' 1-u ' 1-un '" 2-x 1-z 1 2/ETC/BASHRC 1/etc/profile 1/  ETC/PROFILED/1/etc/profiled/*sh 1/usr/bin/id 1/usr/local/sbin 2/usr/sbin 2/usr/share/doc/setup-*/uidgid 1 002 1 022 1 199 1 1 2>/dev/null ' 1;   3;; 
1 = 4 >/dev/null 1 by 1 current 1 euid= ' ID 1 functions 1 histcontrol 1 Histcontrol=ignoreboth 1 Histcontrol=ignoredups 1 histsize 1 histsize=1000 1 HOSTNAME 1 hostname= '/usr/bin/hostname 1 It ' s 2 java_ 
home=/home/softwares/jdk0_111 1 LOGNAME 1 logname= $USER 1 MAIL 1 mail= "/var/spool/mail/$USER" 1 not 1 PATH 1 Path=$1: $PATH 1 path= $PATH: $1 path= $PATH: $JAVA _home/bin 1 PATH 1 System 1 This 1 uid= ' ID 1 USER 1 user= '   ' ID 1 you 1 [9] 3]; 6 a 2 after 2 aliases 1 and 2 are 1 as 1 better 1 case 1 change 1 changes 1 Check 1 could 1 crea  
Te 1 Custom 1 Customsh 1 default, 1 do 1 doing 1 done 1 else 5 Environment 1 environment, 1 ESAC 1 
Export 5 fi 8 file 2 for 5 future 1 get 1 go 1 good 1 I 2 idea 1 if 8 in 6 is 1 it 1 Know 1 Ksh 1 login 2 make 1 manipulation 1 meRging 1 much 1 need 1 Pathmunge 6 prevent 1 programs, 1 reservation 1 reserved 1 script 1 set 1 SE  
TS 1 Setup 1 Shell 2 Startup 1 System 1 The 1 then 8 this 2 threshold 1 to 5 Uid/gids 1 Uidgid 1 Umask 3 unless 1 unset 2 updates 1 validity 1 want 1 We 1 what 1 wide 1 would 1 workaround 1 y 

 OU 2 your 1 {1} 1

This means that the Hadoop cluster is right

The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.