Hadoop single-node Environment Construction
The following describes how to set and configure a single-node Hadoop on Linux, so that you can use Hadoop MapReduce and HDFS (Hadoop Distributed File System) for some simple operations.
Preparation 1) download Hadoop;
2) install JDK for your linux system, the recommended JDK version can be viewed here (http://wiki.apache.org/hadoop/HadoopJavaVersions;
3) Install ssh for your system. Set environment variable 1) set JDK information for Hadoop:
Export JAVA_HOME =/usr/java/latest
2) decompress Hadoop to a directory, such as the/usr/test directory.
Then edit the file/etc/profile to add:
Export HADOOP_INSTALL =/usr/test/hadoop-2.7.1
Export PATH = $ PATH: $ HADOOP_INSTALL/bin
Save the file, and then use the command source/etc/profile to re-compile and make the configuration take effect.
Run the following command. If the configuration is correct, the Hadoop version information is correctly output:
Hadoop version single-node mode by default, Hadoop has been configured to single-node mode, so no additional configuration is required.
The following example shows how to create an input directory, put some files, and run Hadoop:
$ Mkdir input
$ Cp etc/hadoop/*. xml input
$ Bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs [a-z.] +'
$ Cat output /*
Pseudo-distributed Hadoop can also be run in a pseudo-distributed environment. Each Hadoop node is an independent Java Process. The configuration files to be configured include:
Etc/hadoop/core-site.xml:
<Configuration>
<Property>
<Name> fs. defaultFS </name>
<Value> hdfs :/// localhost: 9000 </value>
</Property>
</Configuration>
Etc/hadoop/hdfs-site.xml:
<Configuration>
<Property>
<Name> dfs. replication </name>
<Value> 1 </value>
</Property>
</Configuration>
Set ssh to log on without a key. Use the following method to check whether you can access ssh without a key:
$ Ssh localhost
If you cannot access data without a key, run the following command:
$ Ssh-keygen-t dsa-p'-f ~ /. Ssh/id_dsa
$ Cat ~ /. Ssh/id_dsa.pub> ~ /. Ssh/authorized_keys
$ Export HADOOP \ _ PREFIX =/usr/local/hadoop
Run a local MaReduce task.
1) format the File System
$ Bin/hdfs namenode-format
2) Enable the NameNode and DataNode genie processes.
If the Error "localhost: Error: JAVA_HOME is not set and cocould not be found" appears ", you can configure "export JAVA_HOME =/usr/java/latest" directly in the libexec/hadoop-config.sh ".
The hadoop genie process logs are recorded in the $ HADOOP_LOG_DIR folder. The default value is $ HADOOP_HOME/logs.
3) view the NameNode web interface. The default value is:
-NameNode-http: // localhost: 50070/
4) Specify the HDFS folder used to execute MapReduce tasks
$ Bin/hdfs dfs-mkdir/user
$ Bin/hdfs dfs-mkdir/user/<username>
5) copy the input file to the Distributed File System
$ Bin/hdfs dfs-put etc/hadoop input
Input must be created on the hdfs File System
6) Running examples
$ Bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs [a-z.] +'
Note that the input and output correspond to the hdfs folder.
7) Check output files: copy the output files from the Distributed File System to the local file system and check them.
$ Bin/hdfs dfs-get output
$ Cat output /*
You can also view the output file directly on the Distributed File System:
$ Bin/hdfs dfs-cat output /*
8) when you finish, stop all the genie processes.
$ Sbin/stop-dfs.sh
For a single node YARN, you can use YARN to run a MapReduce task in pseudo-distributed mode. You need to set some parameters and run the ResourceManager and NodeManager genie processes.
Assume that you have done 1 ~ in the previous section ~ Step 4, then do the following steps: 1) configure the etc/hadoop/mapred-site.xml parameters as follows:
<Configuration>
<Property>
<Name> mapreduce. framework. name </name>
<Value> yarn </value>
</Property>
</Configuration>
Configure the etc/hadoop/yarn-site.xml parameters as follows:
<Configuration>
<Property>
<Name> yarn. nodemanager. aux-services </name>
<Value> mapreduce_shuffle </value>
</Property>
</Configuration>
2) Start the ResourceManager and NodeManager genie processes.
$ Sbin/start-yarn.sh
3) view the web interface of ResourceManager. The default value is:
-ResourceManager-http: // localhost: 8088/
4) run a MapReduce task
5) when you finish, stop all the genie processes:
$ Sbin/stop-yarn.sh
You may also like the following articles about Hadoop:
Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04
Install and configure Hadoop2.2.0 on CentOS
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition