Notes on Hadoop single-node & pseudo-distribution Installation
Lab Environment
CentOS 6.X
Hadoop 2.6.0
JDK 1.8.0 _ 65
Purpose
The purpose of this document is to help you quickly install and use Hadoop on a single machine so that you can understand the Hadoop Distributed File System (HDFS) and Map-Reduce framework, for example, run the sample program or simple job on HDFS.
Prerequisites
Supported platforms
GNU/Linux is a platform for product development and operation. Hadoop has been verified on a cluster system consisting of a GNU/Linux host with 2000 nodes.
The Win32 platform is supported by the development platform. Distributed operations have not been fully tested on the Win32 platform, so they are not supported as a production platform.
Install software
If your cluster has not installed the required software, you must first install them.
Take CentOS as an example:
# Yum install ssh rsync-y
# Ssh must be installed and sshd is always running, so that you can use Hadoop scripts to manage the remote Hadoop daemon.
Create user
# Useradd-m hadoop-s/bin/bash # create a new user hadoop
Hosts Parsing
# Cat/etc/hosts | grep ocean-lab
192.168.9.70 ocean-lab.ocean.org ocean-lab
Install jdk
JDK-http://www.Oracle.com/technetwork/java/javase/downloads/index.html
First install the JAVA environment
# Wget -- no-cookies -- no-check-certificate -- header "Cookie: gpw_e24 = http % 3A % 2F % 2Fwww.oracle.com % 2F; using lelicense = accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u65-b17/jdk-8u65-linux-x64.rpm"
# Rpm-Uvh jdk-8u65-linux-x64.rpm
Configure Java
# Echo "export JAVA_HOME =/usr/java/jdk1.8.0 _ 65">/home/hadoop/. bashrc
# Source/home/hadoop/. bashrc
# Echo $ JAVA_HOME
/Usr/java/jdk1.8.0 _ 65
Download and install hadoop
To obtain the release version of Hadoop, download the latest stable release version from an Apache image server.
Preparations for running the Hadoop Cluster
# Wget http://apache.fayea.com/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
Decompress the downloaded Hadoop release. To edit the conf/hadoop-env.sh file, at least set JAVA_HOME to the Java installation root path.
# Tar xf hadoop-2.6.0.tar.gz-C/usr/local
#### Mv/usr/local/hadoop-2.6.0/usr/local/hadoop
Run the following command:
# Bin/hadoop
The usage document of the hadoop script is displayed.
You can start a Hadoop cluster in one of the following three modes:
Standalone Mode
Pseudo-distributed mode
Fully Distributed Mode
Standalone Mode
By default, Hadoop is configured to run an independent Java Process in non-distributed mode. This is very helpful for debugging.
Now we can execute the example to feel how Hadoop runs. Hadoop comes with a wide range of examples, including wordcount, terasort, join, and grep.
In this example, we choose to run grep. All the files in the input Folder are used as input to filter the files that match the regular expression dfs [a-z.]. +, and calculate the number of occurrences, and finally output the results to the output Folder.
# Mkdir input
# Cp conf/*. xml input
#./Bin/hadoop jar./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep./input/./ouput 'dfs [a-z.] +'
# Cat output /*
If the execution is successful, a lot of job-related information will be output, as shown in the final output information. The job result is output in the specified output Folder. Run cat./output/* to view the result. The regular word dfsadmin appears once:
[10:57:58] [hadoop @ ocean-lab hadoop-2.6.0] $ cat./ouput /*
1 dfsadmin
Note: Hadoop does not overwrite the result file by default. Therefore, an error is prompted when running the above instance again. You need to delete./output first.
Otherwise, the following error will be reported:
INFO jvm. jv1_rics: Cannot initialize JVM Metrics with processName = JobTracker, sessionId =-already initialized
Org. apache. hadoop. mapred. FileAlreadyExistsException: Output directory file:/usr/local/hadoop-2.6.0/ouput already exists
If the message "INFO metrics. MetricsUtil: Unable to obtain hostName java.net. UnknowHostException" appears, run the following command to modify the hosts file and add IP ing for your host name:
# Cat/etc/hosts | grep ocean-lab
192.168.9.70 ocean-lab.ocean.org ocean-lab
Operation Method in pseudo-distributed mode
Hadoop can run in pseudo-distributed mode on a single node. At this time, each Hadoop daemon runs as an independent Java Process.
The node acts as both NameNode and DataNode, and reads files in HDFS.
Before setting the Hadoop pseudo-distributed configuration, we also need to set the HADOOP environment variable and execute the following command in ~ /. Set in bashrc
# Hadoop Environment Variables
Export HADOOP_HOME =/usr/local/hadoop-2.6.0
Export HADOOP_INSTALL = $ HADOOP_HOME
Export HADOOP_MAPRED_HOME = $ HADOOP_HOME
Export HADOOP_COMMON_HOME = $ HADOOP_HOME
Export HADOOP_HDFS_HOME = $ HADOOP_HOME
Export YARN_HOME = $ HADOOP_HOME
Export HADOOP_COMMON_LIB_NATIVE_DIR = $ HADOOP_HOME/lib/native
Export PATH = $ PATH: $ HADOOP_HOME/sbin: $ HADOOP_HOME/bin
Source ~ /. Bashrc
Configuration
Use the following etc/hadoop/core-site.xml
<Configuration>
<Property>
<Name> hadoop. tmp. dir </name>
<Value> file:/usr/local/hadoop-2.6.0/tmp </value>
<Description> Abase for other temporary directories. </description>
</Property>
<Property>
<Name> fs. defaultFS </name>
<Value> hdfs :/// localhost: 9000 </value>
</Property>
</Configuration>
Similarly, modify the profile hdfs-site.xml.
<Configuration>
<Property>
<Name> dfs. replication </name>
<Value> 1 </value>
</Property>
<Property>
<Name> dfs. namenode. name. dir </name>
<Value> file:/usr/local/hadoop-2.6.0/tmp/dfs/name </value>
</Property>
<Property>
<Name> dfs. datanode. data. dir </name>
<Value> file:/usr/local/hadoop-2.6.0/tmp/dfs/data </value>
</Property>
</Configuration>
A description of Hadoop configuration items
However, you only need to configure fs. defaultFS and dfs. replication can be run (this is the official tutorial), but hadoop is not configured. tmp. dir parameter, the default temporary directory used is/tmp/hadoo-hadoop, and this directory may be cleared by the system during restart, so you must re-execute the format. Therefore, we have made settings and also specified dfs. namenode. name. dir and dfs. datanode. data. dir. Otherwise, an error may occur in the next step.
Password-free ssh settings
Now, check whether you can log on to localhost Using ssh without entering the password:
# Ssh localhost date
If you do not enter the password, you cannot log on to localhost Using ssh. Run the following command:
# Ssh-keygen-t dsa-p'-f ~ /. Ssh/id_dsa
# Cat ~ /. Ssh/id_dsa.pub> ~ /. Ssh/authorized_keys
# Chmod 600 ~ /. Ssh/authorized_keys
Format a New Distributed File System:
$ Bin/hadoop namenode-format
15/12/23 11:30:20 INFO util. GSet: VM type = 64-bit
15/12/23 11:30:20 INFO util. GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB
15/12/23 11:30:20 INFO util. GSet: capacity = 2 ^ 15 = 32768 entries
15/12/23 11:30:20 INFO namenode. NNConf: ACLs enabled? False
15/12/23 11:30:20 INFO namenode. NNConf: XAttrs enabled? True
15/12/23 11:30:20 INFO namenode. NNConf: Maximum size of an xattr: 16384
15/12/23 11:30:20 INFO namenode. FSImage: Allocated new BlockPoolId: BP-823870322-192.168.9.70-1450841420347
15/12/23 11:30:20 INFO common. Storage: Storage directory/usr/local/hadoop-2.6.0/tmp/dfs/name has been successfully formatted.
15/12/23 11:30:20 INFO namenode. NNStorageRetentionManager: Going to retain 1 images with txid> = 0
15/12/23 11:30:20 INFO util. ExitUtil: Exiting with status 0
15/12/23 11:30:20 INFO namenode. NameNode: SHUTDOWN_MSG:
/*************************************** *********************
SHUTDOWN_MSG: Shutting down NameNode at ocean-lab.ocean.org/192.168.9.70
**************************************** ********************/
If it succeeds, you will see a prompt for "successfully formatted" and "Exitting with status 0 ".
Note:
The next time you start hadoop, you don't need to initialize NameNode, just run./sbin/start-dfs.sh!
Start NameNode and DataNode
$./Sbin/start-dfs.sh
15/12/23 11:37:20 WARN util. NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
Localhost: starting namenode, logging to/usr/local/hadoop-2.6.0/logs/hadoop-hadoop-namenode-ocean-lab.ocean.org.out
Localhost: starting datanode, logging to/usr/local/hadoop-2.6.0/logs/hadoop-hadoop-datanode-ocean-lab.ocean.org.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0. 0.0.0 (0.0.0.0) 'can't be established.
RSA key fingerprint is a5: 26: 42: a0: 5f: da: a2: 88: 52: 04: 9c: 7f: 8d: 6a: 98: 9b.
Are you sure you want to continue connecting (yes/no )? Yes
0.0.0.0: Warning: Permanently added '0. 0.0.0 '(RSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to/usr/local/hadoop-2.6.0/logs/hadoop-hadoop-secondarynamenode-ocean-lab.ocean.org.out
15/12/23 11:37:44 WARN util. NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[13:57:08] [hadoop @ ocean-lab hadoop-2.6.0] $ jps
27686 SecondaryNameNode
Jps 28455
27501 DataNode
27405 NameNode
27006 GetConf
If no process exists, the startup fails to view the bebug log.
After successful startup, you can access the Web interface http: // [ip, fqdn]:/50070 to view NameNode and Datanode information, and view files in HDFS online.
Run a Hadoop pseudo-distributed instance
In the preceding standalone mode, grep reads local data, while pseudo-distributed reads data on HDFS.
To use HDFS, you must first create a user directory in HDFS:
#./Bin/hdfs dfs-mkdir-p/user/hadoop
#./Bin/hadoop fs-ls/user/hadoop
Found 1 items
Drwxr-xr-x-hadoop supergroup 0/user/hadoop/input
Next. the xml file in/etc/hadoop is copied to the distributed file system as the input file, and/usr/local/hadoop/etc/hadoop is copied to/user/hadoop/input in the distributed file system. We use hadoop users and have created the corresponding user directory/user/hadoop. Therefore, we can use the relative path such as input in the command, the corresponding absolute path is/user/hadoop/input:
#./Bin/hdfs dfs-mkdir input
#./Bin/hdfs dfs-put./etc/hadoop/*. xml input
After the copy is complete, run the following command to view the file list in HDFS:
#./Bin/hdfs dfs-ls input
-Rw-r -- 1 hadoop supergroup 4436 input/capacity-scheduler.xml
-Rw-r -- 1 hadoop supergroup 1180 input/core-site.xml
-Rw-r -- 1 hadoop supergroup 9683 input/hadoop-policy.xml
-Rw-r -- 1 hadoop supergroup 1136 input/hdfs-site.xml
-Rw-r -- 1 hadoop supergroup 620 input/httpfs-site.xml
-Rw-r -- 1 hadoop supergroup 3523 input/kms-acls.xml
-Rw-r -- 1 hadoop supergroup 5511 input/kms-site.xml
-Rw-r -- 1 hadoop supergroup 858 input/mapred-site.xml
-Rw-r -- 1 hadoop supergroup 690 input/yarn-site.xml
The Mode for running MapReduce jobs in the pseudo-distributed mode is the same as that of the single-host mode. The difference is that the files in HDFS can be read in the pseudo-distributed mode, the output result output folder is deleted to verify this ).
#./Bin/hadoop jar./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs [a-z.] +'
Command to view the running result (view the output result in HDFS ):
$./Bin/hdfs dfs-cat output /*
1 dfsadmin
1 dfs. replication
1 dfs. namenode. name. dir
1 dfs. datanode. data. dir
The result is as follows. We have changed the configuration file, so the running result is different.
Results of Hadoop pseudo-distributed grep running Hadoop pseudo-distributed running grep
We can also return the running result to the local machine:
# Rm-r./output # Delete the local output Folder first (if any)
#./Bin/hdfs dfs-get output./output # copy the output Folder on HDFS to the local machine.
# Cat./output /*
1 dfsadmin
1 dfs. replication
1 dfs. namenode. name. dir
1 dfs. datanode. data. dir
The output directory cannot exist when Hadoop runs the program. Otherwise, the error "org. apache. hadoop. mapred. fileAlreadyExistsException: Output directory hdfs: // localhost: 9000/user/hadoop/output already exists ". To execute it again, run the following command to delete the output Folder:
# Deleting the output Folder
$./Bin/hdfs dfs-rm-r output
Deleted output
The output directory cannot exist when running the program.
When running a Hadoop program, the output directory (such as output) specified by the program cannot exist to prevent overwriting results. Otherwise, an error is prompted. Therefore, you must delete the output directory before running the program. When developing an application, you can add the following code to the program to automatically delete the output directory at each run to avoid tedious command line operations:
Configuration conf = new Configuration ();
Job job = new Job (conf );
/* Delete the output directory */
Path outputPath = new Path (args [1]);
OutputPath. getFileSystem (conf). delete (outputPath, true );
To disable Hadoop, run
../Sbin/stop-dfs.sh
Start YARN
(The pseudo-distributed architecture does not enable YARN. Generally, it does not affect program execution)
Some readers may wonder how to start Hadoop and cannot see JobTracker and TaskTracker mentioned in the book. This is because the new version of Hadoop uses the new MapReduce framework (MapReduce V2, also known as YARN, yet Another Resource Negotiator ).
YARN is separated from MapReduce and is responsible for resource management and task scheduling. YARN runs on MapReduce and provides high availability and scalability. More information about YARN is not provided here. If you are interested, refer to relevant materials.
The above-mentioned Hadoop startup through./sbin/start-dfs.sh, just started the MapReduce environment, we can start YARN, let YARN for resource management and task scheduling.
First Modify Profile mapred-site.xml
<Configuration>
<Property>
<Name> mapreduce. framework. name </name>
<Value> yarn </value>
</Property>
</Configuration>
Then modify the configuration file yarn-site.xml:
<Configuration>
<Property>
<Name> yarn. nodemanager. aux-services </name>
<Value> mapreduce_shuffle </value>
</Property>
</Configuration>
Then you can start YARN (you need to first execute./sbin/start-dfs.sh ):
#./Sbin/start-yarn.sh # Start YARN
#./Sbin/mr-jobhistory-daemon.sh start historyserver # enable the history server to view the job running status on the Web
After jps is enabled, you can see two background processes: NodeManager and ResourceManager:
[09:18:34] [hadoop @ ocean-lab ~] $ Jps
27686 SecondaryNameNode
6968 ResourceManager
Jps 7305
7066 NodeManager
27501 DataNode
27405 NameNode
After YARN is started, the method for running the instance is the same, except that the resource management mode and task scheduling are different. Observe the log information. If YARN is not enabled, "mapred. LocalJobRunner" is running the task. After YARN is enabled, "mapred. YARNRunner" is running the task. To start YARN, you can view the running status of the task through the Web interface: http: // [ip, fqdn]: 8088/cluster.
After YARN is enabled, you can view the task running information. After YARN is enabled, you can view the task running information.
However, YARN mainly provides better resource management and Task Scheduling for clusters. However, it does not reflect the value on a single machine, but slows down the program. Therefore, whether YARN is enabled on a single machine depends on the actual situation.
Do not start YARN need to delete/rename mapred-site.xml
Otherwise, if the configuration file does not enable YARN, the running program will prompt the error "Retrying connect to server: 0.0.0.0/0.0.0.0: 8032.
Similarly, the YARN disabling script is as follows:
#./Sbin/stop-yarn.sh
#./Sbin/mr-jobhistory-daemon.sh stop historyserver
For more details, please continue to read the highlights on the next page: