Hadoop single-node Environment Construction

Source: Internet
Author: User
Tags hdfs dfs hadoop mapreduce

Hadoop single-node Environment Construction

The following describes how to set and configure a single-node Hadoop on Linux, so that you can use Hadoop MapReduce and HDFS (Hadoop Distributed File System) for some simple operations.

Preparation 1) download Hadoop;
2) install JDK for your linux system, the recommended JDK version can be viewed here (http://wiki.apache.org/hadoop/HadoopJavaVersions;
3) Install ssh for your system. Set environment variable 1) set JDK information for Hadoop:
Export JAVA_HOME =/usr/java/latest
2) decompress Hadoop to a directory, such as the/usr/test directory.
Then edit the file/etc/profile to add:
Export HADOOP_INSTALL =/usr/test/hadoop-2.7.1
Export PATH = $ PATH: $ HADOOP_INSTALL/bin
Save the file, and then use the command source/etc/profile to re-compile and make the configuration take effect.
Run the following command. If the configuration is correct, the Hadoop version information is correctly output:
Hadoop version single-node mode by default, Hadoop has been configured to single-node mode, so no additional configuration is required.
The following example shows how to create an input directory, put some files, and run Hadoop:

$ Mkdir input
$ Cp etc/hadoop/*. xml input
$ Bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs [a-z.] +'
$ Cat output /*

Pseudo-distributed Hadoop can also be run in a pseudo-distributed environment. Each Hadoop node is an independent Java Process. The configuration files to be configured include:
Etc/hadoop/core-site.xml:

<Configuration>
<Property>
<Name> fs. defaultFS </name>
<Value> hdfs :/// localhost: 9000 </value>
</Property>
</Configuration>

Etc/hadoop/hdfs-site.xml:

<Configuration>
<Property>
<Name> dfs. replication </name>
<Value> 1 </value>
</Property>
</Configuration>

Set ssh to log on without a key. Use the following method to check whether you can access ssh without a key:

$ Ssh localhost

If you cannot access data without a key, run the following command:

$ Ssh-keygen-t dsa-p'-f ~ /. Ssh/id_dsa
$ Cat ~ /. Ssh/id_dsa.pub> ~ /. Ssh/authorized_keys
$ Export HADOOP \ _ PREFIX =/usr/local/hadoop

Run a local MaReduce task.

1) format the File System

$ Bin/hdfs namenode-format

2) Enable the NameNode and DataNode genie processes.

If the Error "localhost: Error: JAVA_HOME is not set and cocould not be found" appears ", you can configure "export JAVA_HOME =/usr/java/latest" directly in the libexec/hadoop-config.sh ".
The hadoop genie process logs are recorded in the $ HADOOP_LOG_DIR folder. The default value is $ HADOOP_HOME/logs.
3) view the NameNode web interface. The default value is:

-NameNode-http: // localhost: 50070/

4) Specify the HDFS folder used to execute MapReduce tasks

$ Bin/hdfs dfs-mkdir/user
$ Bin/hdfs dfs-mkdir/user/<username>

5) copy the input file to the Distributed File System

$ Bin/hdfs dfs-put etc/hadoop input

Input must be created on the hdfs File System

6) Running examples

$ Bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs [a-z.] +'

Note that the input and output correspond to the hdfs folder.

7) Check output files: copy the output files from the Distributed File System to the local file system and check them.

$ Bin/hdfs dfs-get output
$ Cat output /*

You can also view the output file directly on the Distributed File System:

$ Bin/hdfs dfs-cat output /*

8) when you finish, stop all the genie processes.

$ Sbin/stop-dfs.sh

For a single node YARN, you can use YARN to run a MapReduce task in pseudo-distributed mode. You need to set some parameters and run the ResourceManager and NodeManager genie processes.
Assume that you have done 1 ~ in the previous section ~ Step 4, then do the following steps: 1) configure the etc/hadoop/mapred-site.xml parameters as follows:

<Configuration>
<Property>
<Name> mapreduce. framework. name </name>
<Value> yarn </value>
</Property>
</Configuration>

Configure the etc/hadoop/yarn-site.xml parameters as follows:

<Configuration>
<Property>
<Name> yarn. nodemanager. aux-services </name>
<Value> mapreduce_shuffle </value>
</Property>
</Configuration>

2) Start the ResourceManager and NodeManager genie processes.

$ Sbin/start-yarn.sh

3) view the web interface of ResourceManager. The default value is:

-ResourceManager-http: // localhost: 8088/

4) run a MapReduce task

5) when you finish, stop all the genie processes:

$ Sbin/stop-yarn.sh

You may also like the following articles about Hadoop:

Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.