Hadoop 2.x pseudo-distributed environment building test

Last Update:2016-05-12 Source: Internet

Author: User

Tags hdfs dfs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop 2.x pseudo-distributed environment building test

tags (space delimited): Hadoop

Hadoop,spark,kafka Exchange Group: 4598988011, building the environment required for Hadoop

Uninstalling the Open JDK
Rpm-qa |grep Java
Rpm-e–nodeps [Java]

1.1, create four directories under the/opt/directory:

modules/software/datas/tools/

Unzip the hadoop-2.5.0 and jdk-7u67-linux-x64.tar.gz to the modules directory.

$tar-zxvf hadoop-2.5.0.tar.gz /opt/modules/$tar-zxvf jdk-7u67-linux-x64.tar.gz /opt/modules/

1.2 Add a Java environment variable.

$sudo vi /etc/profile添加环境变量：export JAVA_HOME=/opt/modules/jdk1.7.0_67export PATH=$PATH:$JAVA_HOME/bin更新配置：#source etc/profile

Set the nodepad++ link to the host where Hadoop resides to modify the configuration.

2,hadoop pseudo-Distributed settings 1, add Java instruction directory to hadoop-env.sh,yarn-env.sh,mepre-env.sh

2,core-site.xml Configuration

    <property >        <name>Hadoop.tmp.dir</name>        <value>/opt/modules/hadoop-2.6.0-cdh5.4.4/data/tmp</value>    </Property >    <property >        <name>Fs.defaultfs</name>        <value>hdfs://adddeimac.local:8020</value>    </Property >

3,hdfs-site.xml Configuration

    <property>        <name>dfs.replication</name>        <value>1</value>    </property>

4,mapred-site.xml Configuration

    <property>        <name>mapreduce.framework.name</name>        <value>yarn</value>    </property>

5,yarn-site.xml Configuration

    <property >        <name>Yarn.resourcemanager.hostname</name>        <value>Adddeimac.local</value>    </Property >    <property >        <name>Yarn.nodemanager.aux-services</name>        <value>Mapreduce_shuffle</value>    </Property >

6,slaves Configuration

miaodonghua.host//主机名：nodemanager和datanode地址

3, start Hadoop. 1, Format file system

$bin/hdfs namenode -format

2, Start HDFs

$sbin/hadoop-daemon.sh start namenode$sbin/hadoop-daemon.sh start datanode

3. Create a directory, upload a file, and view the contents of a file

$bin-mkdir /usr/hadoop/tmp$bin-put etc/slaves /usr/hadoop/tmp

After the upload is successful, you can view it on the Web port:

View slaves Content

$bin-cat /usr/hadoop/tmp/slaves

Start yarn

$sbin/yarn-daemon.sh start resourcemanager$sbin/yarn-daemon.sh start nodemanager

After starting HDFs and yarn success, JPS:

See the WebApp of Namnode and ResourceManager

4, run wordcount1 on yarn, create wordcount input file

vi /opt/datas/wc.input内容：yarn sparkhadoop mapreducemapreduce sparkhdfs yarnyarn mapreducehadoop hdfsspark spark

2, create directory, and upload Wc.input

$bin-mkdir-p /usr/hadoop/mapreduce/wordcount/input$bin-put /opt/datas/wc.input mapreduce/wordcount/input

Upload success:

3, run WordCount.

WebApp View Run Status

Running

Finished

View Results

Results files are generated under the/user/hadoop/mapreduce/output directory

Viewing results in the terminal

$bin/hdfs dfs -text /user/hadoop/mapreduce/wordcount/output/par*

5, my current understanding of Hadoop components 1, the understanding of HDFs

HDFs is a distributed file system suitable for running on a common low-cost server, and HDFs is suitable for storing large files that are not suitable for storing small files, and the files in Hadoop are divided into 64M blocks and the actual storage content is less than 64M, and the actual memory is less than 64M. HDFs has a namenode, a secondarynamenode, and several datanode. Namenode is used to store metadata such as file name, file directory structure, file attributes, and the Datanode of each file block and block. Datanode stores the file data block in the local file system, as well as the checksum of the block data. Secondarynamenode is used to monitor the HDFs status of the secondary daemon, every time to get a snapshot of the HDFs metadata.

2, the understanding of yarn

Yarn is a new Hadoop resource manager, which is a general resource management system that provides unified resource management and scheduling for upper-level applications, which brings great benefits to the cluster in terms of utilization, uniform resource management and data sharing. Yarn Service mainly consists of four parts Resopurcemanager, NodeManager, Applicationmaster and container. Rsourcemanager is primarily responsible for handling client requests, initiating and monitoring allicationmaster monitoring NodeManager and resource allocation and scheduling. NodeManager is primarily responsible for handling commands from ResourceManager and from Allicationmaster, as well as managing resources on a single node. Allicationmaster is primarily responsible for data slicing, requesting resources for applications and assigning them to internal tasks, and also for task monitoring and fault tolerance. Container is mainly an abstraction of the task run environment, encapsulating the CPU, memory and other bits resources, as well as variables, startup commands and other tasks related to the operation of information.

3, understanding of MapReduce

MapReduce is an offline framework. It is suitable for large-scale data operations, he divides the calculation process into two parts, and map and reduce. The map phase mainly processes the input data in parallel and stores the data to the local disk. Reduce primarily reads the data stored in the map phase from disk, and then summarizes the data in the map phase. MapReduce is only suitable for off-line processing and has good fault tolerance and extensibility, and is suitable for simple batch processing. The disadvantage is that startup overhead, excessive use of disk results in inefficiency, and so on.
The specific steps to implement MapReduce are as follows:
1. First slice the input data source
2.master scheduling worker to perform a map task
3.worker read input source Fragment
4.worker perform map task, save task output locally
5.master dispatch worker performs the reduce task, and the reduce worker reads the output file of the map task
6. Perform the reduce task to save the task output to HDFs
is the data flow graph for MapReduce.

Hadoop 2.x pseudo-distributed environment building test

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More