First of all, to illustrate the point is that I do not want to repeat the invention of the wheel. If you want to build a Hadoop environment, there are a lot of detailed steps and command code on the Web, and I don't want to repeat the record.
Secondly, I would like to say that I am also a novice, not very familiar with Hadoop. But just want to actually build a good environment, see his true colors, okay, good, better see. When running WordCount word frequency statistics, it is an exclamation of Hadoop has been distributed to do so well, even without the experience of distributed people, but also need to do some configuration to run a distributed cluster environment.
All right, that's the word.
Some of the things you need to know to build a Hadoop environment:
1.hadoop running on Linux system, you have to install Linux operating system
2. You need to build a cluster running Hadoop, such as a Linux system with access to each other on the LAN
3. In order to achieve the mutual access between the cluster, you need to do SSH without key login
4.hadoop running on the JVM, which means you need to install the Java JDK and configure the Java_home
The components of 5.hadoop are configured through XML. After you download a good Hadoop on the official web, unzip and modify the corresponding configuration file in the/etc/hadoop directory
工欲善其事, its prerequisite. Here's what you can say about the software and tools you use to build a Hadoop environment:
1.virtualbox--after all to simulate several Linux, the condition is limited, create several virtual machine building in VirtualBox
2.centos--download CentOS7 ISO image, load into VirtualBox, install run
3.securecrt--can ssh remote access Linux software
4.winscp--implementation of Windows and Linux communication
5.JDK for Linux--oracle official online download, after decompression configuration can be
6.hadoop2.7.1--can be downloaded from the Apache official web site
Okay, here's a three-step explanation.
Linux Environment Preparation
Configure IP
In order to achieve communication between native and virtual machines and virtual machines and virtual machines, the CentOS connection mode in VirtualBox is Host-only mode, and IP is set manually, noting that the gateway of the virtual machine and the IP address of the HOST-ONLY network in this machine are the same. After you configure IP, restart the network service to make the configuration valid. Three Linux is built here, as shown in the following figure
Configuring host Names
Set host name Hadoop01 for 192.168.56.101. and configure the cluster's IP and host name in the Hosts file. The remaining two hosts operate like this
[Root@hadoop01 ~]# cat/etc/sysconfig/network
# Created by anaconda
networking = yes
HOSTNAME = Hadoop01
[root@hadoop01 ~]# cat/etc/hosts
127.0.0.1 localhost localhost.localdomain Localhost4.localdomain4
:: 1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.56.101 hadoop01
192.168.56.102 hadoop02
Permanently shut down the firewall
Service iptables Stop (1. The next time you restart the machine, the firewall will start again, so you need to permanently close the command of the firewall, 2 because of the CentOS 7, the command to turn off the firewall is as follows)
Systemctl Stop Firewalld.service #停止firewall
systemctl Disable Firewalld.service #禁止firewall开机启动
Turn off the SELinux protection system
Changed to Disabled. Reboot reboot the machine for the configuration to take effect
[Root@hadoop02 ~]# Cat/etc/sysconfig/selinux # This file controls the state of SELinux on the
system
# selinux= C A take one of these three values:
# Enforcing-selinux security Policy is enforced
# Permissive-selinu X prints warnings instead of enforcing
# disabled-no SELinux policy is loaded
selinux=disabled
# Selinu Xtype= can take one of three two values:
# targeted-targeted processes are protected ,
# Minimum-mo Dification of targeted policy only selected processes are protected
# Mls-multi level security protection
selinuxtype=targeted
Cluster SSH password-free login
Set up SSH key first
Copy SSH key to three machines
Ssh-copy-id 192.168.56.101
Ssh-copy-id 192.168.56.103
So if HADOOP01 's machine wants to log in hadoop02, just enter SSH hadoop02
Configuring JDK
Here in/home loyalty create three folders
tools--Store Kits
softwares--Storage Software
data--Storing data
Upload the downloaded Linux jdk to Hadoop01 's/home/tools via WINSCP
Extract JDK into softwares
The JDK home directory is visible in/home/softwares/jdk.x.x.x, the copy of the directory is pasted into the/etc/profile file and set in the file Java_home
Export java_home=/home/softwares/jdk0_111
Save changes, perform source/etc/profile to make configuration effective
To see if the Java JDK was successfully installed:
You can copy the files that are set in the current node to another node
Hadoop cluster installation
Cluster planning is as follows:
101 nodes as HDFs Namenode, the remainder as datanode;102 as yarn ResourceManager, and the rest as NodeManager. 103 as Secondarynamenode. Start Jobhistoryserver and Webappproxyserver at Nodes 101 and 102, respectively.
Download hadoop-2.7.3
and place it in the/home/softwares folder. Because Hadoop requires the installation environment for JDK, first configure the/etc/hadoop/hadoop-env.sh java_home
(PS: I feel the JDK version is too high)
Next, in turn, modify the XML for the corresponding component of Hadoop
Modify Core-site.xml:
Specify Namenode Address
Modify the cache directory for Hadoop
The garbage collection mechanism of Hadoop
<configuration>
<property>
<name>fsdefaultFS</name>
<value>hdfs:// 101:8020</value>
</property>
<property>
<name>hadooptmpdir</name>
<value>/home/softwares/hadoop-3/data/tmp</value>
</property>
<property>
<name>fstrashinterval</name>
<value>10080</value>
</property>
Hdfs-site.xml
Set the number of backups
Turn off permissions
To set the HTTP access interface
Set the IP address of the secondary Namenode
<configuration>
<property>
<name>dfsreplication</name>
<value>3</ value>
</property>
<property>
<name>dfspermissionsenabled</name>
<value>false</value>
</property>
<property>
<name> dfsnamenodehttp-address</name>
<value>101:50070</value>
</property>
<property>
<name>dfsnamenodesecondaryhttp-address</name>
<value>103:50090 </value>
</property>
</configuration>
Modify Mapred-site.xml.template name for Mapred-site.xml
Specifies that the MapReduce frame is yarn and is scheduled by yarn
Specify Jobhitory
Specify the Jobhitory Web port
Open Uber mode-this is for mapreduce optimization
<configuration>
<property>
<name>mapreduceframeworkname</name>
<value >yarn</value>
</property>
<property>
<name>mapreducejobhistoryaddress </name>
<value>101:10020</value>
</property>
<property>
< name>mapreducejobhistorywebappaddress</name>
<value>101:19888</value>
</ property>
<property>
<name>mapreducejobubertaskenable</name>
<value> True</value>
</property>
Modify Yarn-site.xml
Specify MapReduce as Shuffle
Specify 102 nodes as ResourceManager
Specify a 102-node security agent
Turn on yarn Log
Specify Yarn Log deletion time
Specify the NodeManager memory: 8G
Specify the cpu:8 nucleus of the NodeManager
<configuration> <!--Site specific YARN configuration Properties--> <property> <name> yarnnodemanageraux-services</name> <value>mapreduce_shuffle</value> </property> <pro perty> <name>yarnresourcemanagerhostname</name> <value>102</value> </PROPERTY&G
T <property> <name>yarnweb-proxyaddress</name> <value>102:8888</value> </prope
rty> <property> <name>yarnlog-aggregation-enable</name> <value>true</value> </property> <property> <name>yarnlog-aggregationretain-seconds</name> <value> ;604800</value> </property> <property> <name>yarnnodemanagerresourcememory-mb</name&
Gt <value>8192</value> </property> <property> <name>yarnnodemanagerresourcecpu-vcore S</name> <value>8</value> </property> </configuration>
Configure Slaves
Specify the compute node, that is, the node running Datanode and NodeManager
192.168.56.101
192.168.56.102
192.168.56.103
First in the Namenode node format, that is, 101 node execution:
Go to the Hadoop home directory: cd/home/softwares/hadoop-3
Execute the Hadoop script in the bin directory: Bin/hadoop Namenode-format
The presence of successful format is a success (PS, here is the image of others, do not mind ha)
When the above configuration is complete, copy it to another machine
Hadoop Environment Test
Enter the Hadoop home directory to execute the corresponding script file
JPS Command--java Virtual Machine process Status, showing the running Java process
Open the HDFs on the Namenode node 101 machine
[Root@hadoop01 hadoop-3]# sbin/start-dfssh Java HotSpot (TM) Client VM warning:you have loaded library/home/softwares/ Hadoop-3/lib/native/libhadoopso which might have disabled stack guard the VM would try to fix the stack guard now It ' s HIG Hly recommended that you fix the library with ' execstack-c <libfile> ', or link it with '-Z noexecstack ' 16/11/07 1 6:49:19 WARN utilnativecodeloader:unable to load Native-hadoop library for your platform using Builtin-java classes where Applicable starting namenodes on [HADOOP01] hadoop01:starting namenode, logging To/home/softwares/hadoop-3/logs/hadoo P-root-namenode-hadoopout 102:starting Datanode, logging to/home/softwares/hadoop-3/logs/ Hadoop-root-datanode-hadoopout 103:starting Datanode, logging to/home/softwares/hadoop-3/logs/ Hadoop-root-datanode-hadoopout 101:starting Datanode, logging to/home/softwares/hadoop-3/logs/ Hadoop-root-datanode-hadoopout starting secondary namenodes [HADOOP03] hadoop03:starting SecondarynamEnode, logging To/home/softwares/hadoop-3/logs/hadoop-root-secondarynamenode-hadoopout
At this point the 101 node executes JPS and you can see that Namenode and Datanode have started
[Root@hadoop01 hadoop-3]# JPS
7826 JPS
7270
JPS is executed at the 102 and 103 nodes, you can see that Datanode has started
[Root@hadoop02 bin]# JPS
4260 datanode
4488 JPS
[root@hadoop03
~]# JPS 6436 6750 Jps
Start yarn
Execute at 102 node
[Root@hadoop02 hadoop-3]# sbin/start-yarnsh
starting Yarn daemons starting ResourceManager
, logging to/home/ Softwares/hadoop-3/logs/yarn-root-resourcemanager-hadoopout
101:starting NodeManager, logging to/home/ Softwares/hadoop-3/logs/yarn-root-nodemanager-hadoopout
103:starting NodeManager, logging to/home/softwares/ Hadoop-3/logs/yarn-root-nodemanager-hadoopout
JPS View each node:
[Root@hadoop02 hadoop-3]# JPS
4641 ResourceManager 4260 datanode 4765 nodemanager 5165
[ ROOT@HADOOP01 hadoop-3]# JPS
7270 datanode 8375 JPS 7976 nodemanager 7052 namenode
[ ROOT@HADOOP03 ~]# JPS
6915 nodemanager 6436 secondarynamenode 7287 JPS
Start the jobhistory and protection process of the corresponding node separately
[Root@hadoop01 hadoop-3]# sbin/mr-jobhistory-daemonsh start historyserver
starting historyserver, logging To/home /softwares/hadoop-3/logs/mapred-root-historyserver-hadoopout
[root@hadoop01 hadoop-3]# JPS
8624
JPS 7270 datanode
7976 nodemanager
8553 jobhistoryserver 7052
namenode
[root@hadoop02 hadoop-3]# sbin/ Yarn-daemonsh start ProxyServer
starting ProxyServer, logging to/home/softwares/hadoop-3/logs/ Yarn-root-proxyserver-hadoopout
[root@hadoop02 hadoop-3]# JPS
4641 ResourceManager 4260 datanode
5367 webappproxyserver
5402 Jps
View node status through the browser on the HADOOP01 node, that is, the 101 node
HDFs Upload File
Running the WordCount program
[Root@hadoop01 hadoop-3]# bin/hadoop jar Share/hadoop/mapreduce/hadoop-mapreduce-examples-jar WORDCOUNT/PROFILE/FLL _out Java HotSpot (TM) Client VM warning:you have loaded LIBRARY/HOME/SOFTWARES/HADOOP-3/LIB/NATIVE/LIBHADOOPSO which mi Ght have disabled stack guard the VM would try to fix the stack guard now It ' s highly recommended the fix the library With ' execstack-c <libfile> ', or link it with '-Z noexecstack ' 16/11/07 17:17:10 WARN utilnativecodeloader:unabl E to load Native-hadoop library for your platform using Builtin-java classes where applicable 16/11/07 17:17:12 INFO Clie Ntrmproxy:connecting to ResourceManager at/102:8032 16/11/07 17:17:18 INFO inputfileinputformat:total input paths to P Rocess:1 16/11/07 17:17:19 Info mapreducejobsubmitter:number of splits:1 16/11/07 17:17:19 INFO mapreduceJobSubmitter : Submitting tokens for job:job_1478509135878_0001 16/11/07 17:17:20 INFO implyarnclientimpl:submitted application appl
ication_1478509135878_000116/11/07 17:17:20 INFO mapreducejob:the URL to track the JOB:HTTP://102:8888/PROXY/APPLICATION_1478509135878_0001/16/1 1/07 17:17:20 Info mapreducejob:running job:job_1478509135878_0001 16/11/07 17:18:34 info mapreducejob:job job_1478509 135878_0001 running in Uber mode:true 16/11/07 17:18:35 info mapreducejob:map 0% reduce 0% 16/11/07 17:18:43 info Map Reducejob:map 100% reduce 0% 16/11/07 17:18:50 info mapreducejob:map 100% reduce 100% 16/11/07 17:18:55 info mapreduce Job:job job_1478509135878_0001 completed successfully 16/11/07 17:18:59 INFO mapreducejob:counters:52 File System Counters File:number of bytes read=4264 file:number of bytes written=6412 File:number of RE Ad operations=0 File:number of large read operations=0 File:number of write Operations=0 HDFS
: Number of bytes read=3940 hdfs:number of bytes written=261673 Hdfs:number of Read operations=35 Hdfs:number of LarGE Read operations=0 hdfs:number of write operations=8 Job counters launched map Tasks=1 Launched reduce Tasks=1 Other local map Tasks=1 total time spent by all maps in occupied slots (ms) =824 6 total time spent by all reduces in occupied slots (ms) =7538 total_launched_ubertasks=2 Num_ub Er_submaps=1 Num_uber_subreduces=1 Total time spent from all map tasks (ms) =8246 total time spent By all reduce tasks (MS) =7538 total Vcore-milliseconds taken by all map tasks=8246 Total vcore-millisec Onds taken by all reduce tasks=7538 total megabyte-milliseconds taken by all map tasks=8443904 Total Meg Abyte-milliseconds taken by all reduce tasks=7718912 map-reduce Framework map Input records=78 map Output records=256 map output bytes=2605 map output materialized bytes=2116 Input split bytes=9 9 Combine Input records=256 Combine output records=156 reduce input groups=156 reduce shuffle bytes=2116
Reduce input records=156 Reduce output records=156 spilled records=312 shuffled Maps =1
Failed Shuffles=0 merged Map Outputs=1 GC time Elapsed (ms) =870 CPU Time Spent (ms) =1970 Physical memory (bytes) snapshot=243326976 Virtual Memory (bytes) snapshot=2666557440 total com
mitted heap Usage (bytes), =256876544 Shuffle Errors bad_id=0 connection=0 io_error=0
Wrong_length=0 wrong_map=0 wrong_reduce=0 File Input Format counters Bytes read=1829 File Output Format Counters Bytes written=1487
View running status through yarn in the browser
Look at the last word frequency statistic results
View the HDFs file system in the browser
[Root@hadoop01 hadoop-3]# Bin/hdfs dfs-cat/fll_out/part-r-00000 Java HotSpot (TM) Client VM warning:you have loaded Li Brary/home/softwares/hadoop-3/lib/native/libhadoopso which might have disabled stack guard the VM would try to fix the STA CK Guard Now it's highly recommended that your fix the library with ' execstack-c <libfile> ', or link It with ' z ' no Execstack ' 16/11/07 17:29:17 WARN utilnativecodeloader:unable to load Native-hadoop library for your platform using Buil
Tin-java classes where applicable!= 1 "$-" 1 "$" 1 "$EUID" 2 "$HISTCONTROL" 1 "$i" 3 "${-#*i}" 1 "0" 1
": ${path}:" 1 "' ID 2" after "1" ignorespace "1 # $UID 1 && 1 () 1 *) 1 *:" $:*) 1 -F 1-gn ' "1-gt 1-r 1-ru ' 1-u ' 1-un '" 2-x 1-z 1 2/ETC/BASHRC 1/etc/profile 1/ ETC/PROFILED/1/etc/profiled/*sh 1/usr/bin/id 1/usr/local/sbin 2/usr/sbin 2/usr/share/doc/setup-*/uidgid 1 002 1 022 1 199 1 1 2>/dev/null ' 1; 3;;
1 = 4 >/dev/null 1 by 1 current 1 euid= ' ID 1 functions 1 histcontrol 1 Histcontrol=ignoreboth 1 Histcontrol=ignoredups 1 histsize 1 histsize=1000 1 HOSTNAME 1 hostname= '/usr/bin/hostname 1 It ' s 2 java_
home=/home/softwares/jdk0_111 1 LOGNAME 1 logname= $USER 1 MAIL 1 mail= "/var/spool/mail/$USER" 1 not 1 PATH 1 Path=$1: $PATH 1 path= $PATH: $1 path= $PATH: $JAVA _home/bin 1 PATH 1 System 1 This 1 uid= ' ID 1 USER 1 user= ' ' ID 1 you 1 [9] 3]; 6 a 2 after 2 aliases 1 and 2 are 1 as 1 better 1 case 1 change 1 changes 1 Check 1 could 1 crea
Te 1 Custom 1 Customsh 1 default, 1 do 1 doing 1 done 1 else 5 Environment 1 environment, 1 ESAC 1
Export 5 fi 8 file 2 for 5 future 1 get 1 go 1 good 1 I 2 idea 1 if 8 in 6 is 1 it 1 Know 1 Ksh 1 login 2 make 1 manipulation 1 meRging 1 much 1 need 1 Pathmunge 6 prevent 1 programs, 1 reservation 1 reserved 1 script 1 set 1 SE
TS 1 Setup 1 Shell 2 Startup 1 System 1 The 1 then 8 this 2 threshold 1 to 5 Uid/gids 1 Uidgid 1 Umask 3 unless 1 unset 2 updates 1 validity 1 want 1 We 1 what 1 wide 1 would 1 workaround 1 y
OU 2 your 1 {1} 1
This means that the Hadoop cluster is right
The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.