Objective
The purpose of this document is to help you quickly complete Hadoop installation and use on a single machine so you can experience the Hadoop Distributed File System (HDFS) and map-reduce frameworks, such as running sample programs or simple jobs on HDFS.
Prerequisites Support Platform
- Gnu/linux is a platform for product development and operation. Hadoop has been validated on a clustered system consisting of 2000-node Gnu/linux hosts.
- The WIN32 platform is supported as a development platform . Because distributed operations are not fully tested on the Win32 platform, they are not supported as a production platform .
Required Software
The software required for Linux and Windows includes:
- javatm1.5.x, must be installed, it is recommended to choose the Java version released by Sun Company.
- SSH must be installed and guaranteed to run sshd to manage the remote Hadoop daemon with Hadoop scripts.
Additional software requirements under Windows
- Cygwin-shell support is provided outside of the above software.
Installing the Software
If your cluster does not have the required software installed, you will have to install them first.
Take Ubuntu Linux for example:
$ sudo apt-get install SSH
$ sudo apt-get install rsync
On the Windows platform, if all required software is not installed when installing Cygwin, you need to start Cyqwin Setup Manager to install the following packages:
Download
To get the release version of Hadoop, download the most recent stable release from one of the image servers in Apache.
Preparing to run a Hadoop cluster
Unzip the downloaded Hadoop release. To edit the conf/hadoop-env.sh file, at a minimum, you need to set Java_home to the JAVA installation root path.
Try the following command:
$ bin/hadoop
The usage documentation for the Hadoop script will be displayed.
Now you can start the Hadoop cluster in one of the following three supported modes:
- Stand-alone mode
- Pseudo-distributed mode
- Fully distributed mode
How to operate the standalone mode
By default, Hadoop is configured as a standalone Java process that runs in non-distributed mode. This is very helpful for debugging.
The following example finds and displays an entry that matches a given regular expression by taking the extracted conf directory copy as input. The output is written to the specified output directory.
$ mkdir Input
$ CP Conf/*.xml Input
$ bin/hadoop jar hadoop-*-examples.jar grep input Output ' dfs[a-z. +
$ cat output/*
Operation method of Pseudo-distributed mode
Hadoop can run in so-called pseudo-distributed mode on a single node, where each Hadoop daemon runs as a standalone Java process.
Configuration
Use the following Conf/hadoop-site.xml:
<configuration> |
<property> |
<name>fs.default.name</name> |
<value>localhost:9000</value> |
</property> |
<property> |
<name>mapred.job.tracker</name> |
<value>localhost:9001</value> |
</property> |
<property> |
<name>dfs.replication</name> |
<value>1</value> |
</property> |
</configuration> |
Password-free SSH settings
Now verify that you can log in to localhost with ssh without entering your password:
$ ssh localhost
If you do not enter a password, you cannot log in to localhost with SSH and execute the following command:
$ ssh-keygen-t Dsa-p "-F ~/.SSH/ID_DSA
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Perform
To format a new Distributed File system:
$ bin/hadoop Namenode-format
Start the Hadoop daemon:
$ bin/start-all.sh
The log of the HADOOP daemon is written to the ${hadoop_log_dir} directory (default is ${hadoop_home}/logs).
Browse the network interfaces for Namenode and Jobtracker, with their addresses by default:
- namenode-http://localhost:50070/
- jobtracker-http://localhost:50030/
Copy the input files to the Distributed File system:
$ bin/hadoop fs-put conf input
To run the sample program provided by the release version:
$ bin/hadoop jar hadoop-*-examples.jar grep input Output ' dfs[a-z. +
To view the output file:
Copy the output file from the Distributed file system to the local file system view:
$ bin/hadoop fs-get Output output
$ cat output/*
Or
To view the output file on a distributed File system:
$ bin/hadoop Fs-cat output/*
When all is done, stop the daemon:
$ bin/stop-all.sh
How to operate fully distributed mode
For a fully distributed model, a meaningful cluster of information can be found here.
For more information, see the official documentation: Hadoop QuickStart
Hadoop Quick Start