Objective
The purpose of this document is to help you quickly complete Hadoop installation and use on a single machine so you can experience the Hadoop Distributed File System (HDFS) and map-reduce frameworks, such as running sample programs or simple jobs on HDFS.
Prerequisite
Support Platform
- Gnu/linux is a platform for product development and operation. Hadoop has been validated on a clustered system consisting of 2000-node Gnu/linux hosts.
Ubuntu linux:http://mirrors.aliyun.com/ubuntu-releases/14.10/
- The WIN32 platform is supported as a development platform . Because distributed operations are not fully tested on the Win32 platform, they are not supported as a production platform .
Required Software
The software required for Linux and Windows includes:
- javatm1.5.x, must be installed, it is recommended to choose the Java version released by Sun Company. : http://www.java.com/zh_CN/download/manual.jsp, select Liunx X64 or Linux X86 version.
- SSH must be installed and guaranteed to run sshd to manage the remote Hadoop daemon with Hadoop scripts.
Additional software requirements under Windows
- Cygwin-shell support is provided outside of the above software.
Installing the Software
If your cluster does not have the required software installed, you will have to install them first.
$ sudo apt-get install SSH
$ sudo apt-get install rsync
On the Windows platform, if all required software is not installed when installing Cygwin, you need to start Cyqwin Setup Manager to install the following packages:
Download
To get the release version of Hadoop, download the most recent stable release from one of the image servers in Apache.
Preparing to run a Hadoop cluster
Unzip the downloaded Hadoop release. To edit the conf/hadoop-env.sh file, at a minimum, you need to set Java_home to the Java installation root path.
Try the following command:
$ bin/hadoop
The usage documentation for the Hadoop script will be displayed.
Now you can start the Hadoop cluster in one of the following three supported modes:
- Stand-alone mode
- Pseudo-distributed mode
- Fully distributed mode
How to operate the standalone mode
By default, Hadoop is configured as a standalone Java process that runs in non-distributed mode. This is very helpful for debugging.
The following example finds and displays an entry that matches a given regular expression by taking the extracted conf directory copy as input. The output is written to the specified output directory.
$ mkdir Input
$ CP Conf/*.xml Input
$ bin/hadoop jar hadoop-*-examples.jar grep input Output ' dfs[a-z. +
$ cat output/*
Operation method of Pseudo-distributed mode
Hadoop can run in so-called pseudo-distributed mode on a single node, where each Hadoop daemon runs as a standalone Java process.
Configuration
Use the following conf/hadoop-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>localhost:9000</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Password-free SSH settings
Now verify that you can log in to localhost with ssh without entering your password:
$ ssh localhost
If you do not enter a password, you cannot log in to localhost with SSH and execute the following command:
$ ssh-keygen-t Dsa-p "-F ~/.SSH/ID_DSA
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Perform
To format a new Distributed File system:
$ bin/hadoop Namenode-format
Start the Hadoop daemon:
$ bin/start-all.sh
The log of the HADOOP daemon is written to the ${hadoop_log_dir} directory (default is ${hadoop_home}/logs).
Browse the network interfaces for Namenode and Jobtracker, with their addresses by default:
- namenode-http://localhost:50070/
- jobtracker-http://localhost:50030/
Copy the input files to the Distributed File system:
$ bin/hadoop fs-put conf input
To run the sample program provided by the release version:
$ bin/hadoop jar hadoop-*-examples.jar grep input Output ' dfs[a-z. +
To view the output file:
Copy the output file from the Distributed file system to the local file system view:
$ bin/hadoop fs-get Output output
$ cat output/*
Or
To view the output file on a distributed File system:
$ bin/hadoop Fs-cat output/*
When all is done, stop the daemon:
$ bin/stop-all.sh
Single-machine installation of the Hadoop environment