Hadoop 2.x pseudo-distributed environment building test
tags (space delimited): Hadoop
Hadoop,spark,kafka Exchange Group: 4598988011, building the environment required for Hadoop
Uninstalling the Open JDK
Rpm-qa |grep Java
Rpm-e–nodeps [Java]
1.1, create four directories under the/opt/directory:
modules/software/datas/tools/
Unzip the hadoop-2.5.0 and jdk-7u67-linux-x64.tar.gz to the modules directory.
$tar-zxvf hadoop-2.5.0.tar.gz /opt/modules/$tar-zxvf jdk-7u67-linux-x64.tar.gz /opt/modules/
1.2 Add a Java environment variable.
$sudo vi /etc/profile添加环境变量:export JAVA_HOME=/opt/modules/jdk1.7.0_67export PATH=$PATH:$JAVA_HOME/bin更新配置:#source etc/profile
Set the nodepad++ link to the host where Hadoop resides to modify the configuration.
2,hadoop pseudo-Distributed settings 1, add Java instruction directory to hadoop-env.sh,yarn-env.sh,mepre-env.sh
2,core-site.xml Configuration
<property > <name>Hadoop.tmp.dir</name> <value>/opt/modules/hadoop-2.6.0-cdh5.4.4/data/tmp</value> </Property > <property > <name>Fs.defaultfs</name> <value>hdfs://adddeimac.local:8020</value> </Property >
3,hdfs-site.xml Configuration
<property> <name>dfs.replication</name> <value>1</value> </property>
4,mapred-site.xml Configuration
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
5,yarn-site.xml Configuration
<property > <name>Yarn.resourcemanager.hostname</name> <value>Adddeimac.local</value> </Property > <property > <name>Yarn.nodemanager.aux-services</name> <value>Mapreduce_shuffle</value> </Property >
6,slaves Configuration
miaodonghua.host//主机名:nodemanager和datanode地址
3, start Hadoop. 1, Format file system
$bin/hdfs namenode -format
2, Start HDFs
$sbin/hadoop-daemon.sh start namenode$sbin/hadoop-daemon.sh start datanode
3. Create a directory, upload a file, and view the contents of a file
$bin-mkdir /usr/hadoop/tmp$bin-put etc/slaves /usr/hadoop/tmp
After the upload is successful, you can view it on the Web port:
View slaves Content
$bin-cat /usr/hadoop/tmp/slaves
Start yarn
$sbin/yarn-daemon.sh start resourcemanager$sbin/yarn-daemon.sh start nodemanager
After starting HDFs and yarn success, JPS:
See the WebApp of Namnode and ResourceManager
4, run wordcount1 on yarn, create wordcount input file
vi /opt/datas/wc.input内容:yarn sparkhadoop mapreducemapreduce sparkhdfs yarnyarn mapreducehadoop hdfsspark spark
2, create directory, and upload Wc.input
$bin-mkdir-p /usr/hadoop/mapreduce/wordcount/input$bin-put /opt/datas/wc.input mapreduce/wordcount/input
Upload success:
3, run WordCount.
WebApp View Run Status
Running
Finished
View Results
Results files are generated under the/user/hadoop/mapreduce/output directory
Viewing results in the terminal
$bin/hdfs dfs -text /user/hadoop/mapreduce/wordcount/output/par*
5, my current understanding of Hadoop components 1, the understanding of HDFs
HDFs is a distributed file system suitable for running on a common low-cost server, and HDFs is suitable for storing large files that are not suitable for storing small files, and the files in Hadoop are divided into 64M blocks and the actual storage content is less than 64M, and the actual memory is less than 64M. HDFs has a namenode, a secondarynamenode, and several datanode. Namenode is used to store metadata such as file name, file directory structure, file attributes, and the Datanode of each file block and block. Datanode stores the file data block in the local file system, as well as the checksum of the block data. Secondarynamenode is used to monitor the HDFs status of the secondary daemon, every time to get a snapshot of the HDFs metadata.
2, the understanding of yarn
Yarn is a new Hadoop resource manager, which is a general resource management system that provides unified resource management and scheduling for upper-level applications, which brings great benefits to the cluster in terms of utilization, uniform resource management and data sharing. Yarn Service mainly consists of four parts Resopurcemanager, NodeManager, Applicationmaster and container. Rsourcemanager is primarily responsible for handling client requests, initiating and monitoring allicationmaster monitoring NodeManager and resource allocation and scheduling. NodeManager is primarily responsible for handling commands from ResourceManager and from Allicationmaster, as well as managing resources on a single node. Allicationmaster is primarily responsible for data slicing, requesting resources for applications and assigning them to internal tasks, and also for task monitoring and fault tolerance. Container is mainly an abstraction of the task run environment, encapsulating the CPU, memory and other bits resources, as well as variables, startup commands and other tasks related to the operation of information.
3, understanding of MapReduce
MapReduce is an offline framework. It is suitable for large-scale data operations, he divides the calculation process into two parts, and map and reduce. The map phase mainly processes the input data in parallel and stores the data to the local disk. Reduce primarily reads the data stored in the map phase from disk, and then summarizes the data in the map phase. MapReduce is only suitable for off-line processing and has good fault tolerance and extensibility, and is suitable for simple batch processing. The disadvantage is that startup overhead, excessive use of disk results in inefficiency, and so on.
The specific steps to implement MapReduce are as follows:
1. First slice the input data source
2.master scheduling worker to perform a map task
3.worker read input source Fragment
4.worker perform map task, save task output locally
5.master dispatch worker performs the reduce task, and the reduce worker reads the output file of the map task
6. Perform the reduce task to save the task output to HDFs
is the data flow graph for MapReduce.
Hadoop 2.x pseudo-distributed environment building test