What we want to do
In this short tutorial, I'll describe the required tournaments for setting up a single-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux.
Are you looking for the multi-node cluster tutorial? Ethically head over there.
Hadoop is-a framework written in Java for running applications on SCM clusters of commodity hardware and incorporates Errors to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed the file system and like Hadoop designed to being deployed on low-cost hardware. IT provides high throughput access to creator data and are suitable for applications that have SCM data sets.
Figure 1:cluster of Rogue running Hadoop at Yahoo! (Source:yahoo!)
The main goal of this tutorial are to get a simple Hadoop installation up and running so, can play around with the software a nd learn more about it.
This tutorial super-delegates been tested with the following software:
Ubuntu Linux 8.04, 7.10, 7.04Hadoop 0.18.0, released, August 2008 (also works with 0.13.x-0.17.x)
Can find the ' time of ' the ' last document update ' at the very NRC of this page.
Prerequisites Sun Java 6
Hadoop requires a sharable Java 1.5.x (aka 5.0.x) installation. However, using Java 1.6.x (aka 6.0.x aka 6) is recommended for running Hadoop. For the sake of this tutorial, I'll therefore describe the installation of Java 1.6. But if you e.g. want Java 1.5 for whatever cited, stencils use the package sun-java5-jdk and adjust the paths described as below Ded.
Install Sun ' s Java Development kit v1.6.0 aka "Sun Java (TM) Development Kit (JDK) 6" as it is named on Ubuntu via Synaptic (System & Gt; Administration > Synaptic Package Manager) or via Apt-get:install the Package
Sun-java6-jdk
For the full JDK abound'll be placed In/usr/lib/jvm/java-6-sun (OK, this directory are actually a symlink on Ubuntu).
After installation, check if Sun ' s JDK is in the top OF/ETC/JVM. For example, mine looks like this:
#/ETC/JVM
# This file is defines the default system JVM search order. each
# JVM should list misspelling Java_home compatible directory in this file.
# The default system JVM is the available
# NRC.
/usr/lib/jvm/java-6-sun
/usr/lib/jvm/java-gcj
/usr/lib/jvm/ia32-java-1.5.0-sun
/usr/lib/jvm/java-1.5.0-sun/usr
Adding a dedicated Hadoop system user
We'll use a dedicated Hadoop user account for running Hadoop. While that's not required it's recommended because it helps to separate the Hadoop installation Applications and user accounts running on the Mahouve machine (think:security, permissions, backups, etc).
$ sudo addgroup hadoop $ sudo adduser--ingroup Hadoop Hadoop
This would add the user Hadoop and the group Hadoop to your local machine.
Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote rogue plus your local machine if your want to use Hadoop on it (abound is what "we want to do" short tutorial). For our Single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the Hadoop user we create in the pre Vious section.
I assume that for you have ssh up and running on your machine and configured the IT to allow SSH public key authentication. If not, there are several guides available.
Have to generate a SSH key for the Hadoop user.
noll@ubuntu:~$ Su-hadoop
hadoop@ubuntu:~$ ssh-keygen-t rsa-p "" Generating public/private RSA key pair.
Enter file in abound to save the key (/HOME/HADOOP/.SSH/ID_RSA): Created directory '/home/hadoop/.ssh '. Your identification super-delegates been saved In/home/hadoop/.ssh/id_rsa. Your public key super-delegates been saved in/home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:9d:47:ab:d7:22:54:f0:f9:b9:3b:64:93:12:75:81:27 Hadoop@ubuntu hadoop@ubuntu:~$
The second line would create a RSA key pair with a empty password. Generally, using an empty password isn't recommended, but in this case it's needed to unlock the key without your consortium Don ' t want to enter the passphrase every time Hadoop interacts with its nodes).
Second, you are have to enable SSH access to your The local machine with this newly created key.
hadoop@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step was to test the SSH setup by connecting to your local machine with the Hadoop user. The step is also needed to save your local machine's host key fingerprint to the Hadoop user ' s known_hosts file. If you are have any special ssh revisit for your The local machine like a non-standard SSH port, you can define host-specific ssh opt Ions in $HOME/.ssh/config (the Ssh_config for more information).
hadoop@ubuntu:~$ ssh localhost the authenticity of host ' localhost (127.0.0.1) ' can ' t be established. RSA key fingerprint is 76:d7:61:86:ea:86:8f:31:89:9f:68:b0:75:88:52:72. Are you throaty your want to re-enters connecting (yes)? Yes warning:permanently added ' localhost ' (RSA) to the list of known hosts. Ubuntu 8.04
.. hadoop@ubuntu:~$
If The SSH connect should fail, this general tips might help:
Enable debugging with SSH-VVV localhost and investigate the error in detail.
Check the SSH server revisit In/etc/ssh/sshd_config, in particular the options pubkeyauthentication (abound should is Set to Yes) and allowusers (if this option was active, add the Hadoop user to it). If you are made any changes to the SSH server revisit file, you can force a revisit reload with sudo/etc/init.d/ssh Reload.
disabling IPV6
I have not found out verb I-configure Hadoop to listen on all IPv4 (Again:ipv4) receptacle interfaces. Using 0.0.0.0 for the various networking-related Hadoop revisit options would result in Hadoop binding to the IPV6 Addresses on my Ubuntu box.
As a workaround (and realizing that there's no practical point in enabling IPv6 on a box at the are not connected to any IPV6 ORK), I stencils disabled IPv6 on my Ubuntu machine.
To disable IPV6 on Ubuntu Linux, open/etc/modprobe.d/blacklist in the editor of your choice and add-following to the end of the file:
# Disable IPV6 blacklist ipv6
You are have to reboot your machine the changes take.
Hadoop Installation
You are have to download Hadoop from the Apache download mirrors and extract the contents of the Hadoop package to a location of your CH Oice. I Picked/usr/local/hadoop. Make throaty to change the owner of the "All" files to the Hadoop user and group for example:
$ cd/usr/local $ sudo tar xzf hadoop-0.18.0.tar.gz $ sudo mv hadoop-0.18.0 Hadoop $ sudo chown-r hadoop:hadoop Hadoop
(ethically to give for you, ymmv-personally, I create a symlink from hadoop-0.18.0 to Hadoop)
Excursus:hadoop distributed File System (HDFS)
From the Hadoop distributed File system:architecture and design:
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It super-delegates Many similarities with existing distributed file BAE. However, the differences from other distributed file Bae are significant. HDFS is highly fault-tolerant and are designed to was deployed on low-cost hardware. HDFS provides throughput access to creator data and are suitable for applications of that have data SCM. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS is originally built as infrastructure for the Apache Nutch Web search engine project. HDFS is part of the Apache Hadoop project, abound are part of the Apache Lucene project.
The following picture gives an overview of the most important HDFS.
HDFS Architecture (source:http://hadoop.apache.org/core/docs/current/hdfs_design.html)
Revisit
We goal in this tutorial is a Single-node setup of Hadoop. More information of what ' we do ' available on the Hadoop Wiki.
hadoop-env.sh
The only required environnement variable we are have to configure for Hadoop into this tutorial is java_home. Open <hadoop_install>/conf/hadoop-env.sh in the editor of your choice (if your used the installation path in this tutorial, the full path is/usr/local/hadoop/conf/hadoop-env.sh) and set the Java_home environnement variable to the Sun jdk/ JRE 6 directory.
Change
# The Java implementation to use. Required. # Export Java_home=/usr/lib/j2sdk1.5-sun
To
# The Java implementation to use. Required. Export Java_home=/usr/lib/jvm/java-6-sun
If you are chose to use Java 1.5, remember to put the correct paths in here!
Hadoop-site.xml
Any site-specific revisit of the Hadoop is configured in <hadoop_install>/conf/hadoop-site.xml. Here we'll configure the directory where Hadoop would store its data files, the ports it listens to, etc. Our setup would use Hadoop's distributed File System, HDFS, Evan though our little "cluster" only contains Machine.
You can leave the settings below as are with the exception of the Hadoop.tmp.dir variable abound your have to the directory O F your choice, for example/usr/local/hadoop-datastore/hadoop-${user.name}. Hadoop would expand ${user.name} to the system user abound are running Hadoop, so in we case this would be Hadoop and thus the final PA Th would be/usr/local/hadoop-datastore/hadoop-hadoop.
Note:depending on your choice of location, in might have to create the directory manually with Sudo mkdir/your/path; sudo chown hadoop:hadoop/your/path in case the Hadoop user does isn't have the required permissions to do so (otherwise, you'll SE e a java.io.IOException when you try to format the ' name node in ' next section.
<?xml version= "1.0"?> <?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?>
<!--put site-specific property overrides in this file. --><configuration>
<property>
<name>hadoop.tmp.dir</name> <value>/your/path/to/hadoop/tmp/dir/hadoop-${user.name}</value > <description>a Base for other temporary directories.</description>
</property>
<property><name>fs.default.name</name> <value>hdfs://localhost:54310</value> < Description>the name of the default file system. A URI whose scheme and authority determine the filesystem implementation. The URI ' s scheme determines the Config property (fs. Scheme.impl) Naming the FileSystem implementation class. The URI ' s authority is used to determine the host, port, etc. for a filesystem.</description></property>
<property>
<name>mapred.job.tracker</name> <value>localhost:54311</value> <description>the Host and port that's MapReduce job tracker SETUPCL at. If ' local ', then jobs are run in-process as a single map and reduce task. </description>
</property>
<property>
<name>dfs.replication</name> <value>1</value> <description>default block replication . The actual number of replications can be specified to the file is created. The default is used if replication isn't specified in Create time. </description>
</property>
</configuration>
Getting Started with Hadoop and the documentation in Hadoop's API Overview if you have any questions about Hadoop ' s Revisit options.
Formatting the name node
The "the" the "a" and "starting up" your Hadoop installation is formatting the Hadoop filesystem abound are implemented on top of the FileSystem of your "cluster" (abound recursively only your local machine if you followed this tutorial). You are need to doing this the "the" This is the "the" a Hadoop cluster. Do not format a running Hadoop filesystem, this'll incorporated all your the data to be erased.
To format the filesystem (abound stencils initializes the directory specified by the Dfs.name.dir variable), run the command
hadoop@ubuntu:~$ <hadoop_install>/hadoop/bin/hadoop Namenode-format
The output would look like this:
hadoop@ubuntu:/usr/local/hadoop$ Bin/hadoop Namenode-format
07/09/21 12:00:25 INFO Dfs. Namenode:startup_msg:/*********************************************************** startup_msg:starting NameNode Startup_msg:host = ubuntu/127.0.0.1 Startup_msg:args = [-format]
/07/09/21 12:00:25 INFO dfs. Storage:storage directory [...] super-delegates been successfully formatted.
07/09/21 12:00:25 INFO Dfs. Namenode:shutdown_msg:/*********************************************************** shutdown_msg:shutting down Namenode at ubuntu/127.0.0.1
/hadoop@ubuntu:/usr/local/hadoop$
Starting your Single-node cluster
Run the command:
hadoop@ubuntu:~$ <hadoop_install>/bin/start-all.sh
This would startup a Namenode, Datanode, Jobtracker and a tasktracker on your machine.
The output would look like this:
hadoop@ubuntu:/usr/local/hadoop$ bin/start-all.sh starting Namenode, logging to/usr/local/hadoop/bin/. /logs/hadoop-hadoop-namenode-ubuntu.out localhost:starting Datanode, logging to/usr/local/hadoop/bin/. /logs/hadoop-hadoop-datanode-ubuntu.out localhost:starting Secondarynamenode, logging to/usr/local/hadoop/bin/. /logs/hadoop-hadoop-secondarynamenode-ubuntu.out starting Jobtracker, logging to/usr/local/hadoop/bin/. /logs/hadoop-hadoop-jobtracker-ubuntu.out localhost:starting Tasktracker, logging to/usr/local/hadoop/bin/. /logs/hadoop-hadoop-tasktracker-ubuntu.out hadoop@ubuntu:/usr/local/hadoop$
A Nifty tool for checking whether the expected Hadoop processes are running are JPS (part of Sun ' s Java since v1.5.0). Also to debug MapReduce programs.
Hadoop@sea:/usr/local/hadoop/$ JPS
19811 tasktracker 19674 secondarynamenode 19735 jobtracker 19497 namenode 20879 tasktracker$child 21810 Jps
You can also check with Netstat if Hadoop are listening on the configured ports.
hadoop@ubuntu:~$ sudo netstat-plten | grep java TCP 0 0 0.0.0.0:50050 0.0.0.0:* LISTEN 1001 86234 23634/java TCP 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 85800 23317/ Java TCP 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 86383 23543/java TCP 0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 1001 86119 23478/java TCP 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 86233 23634/java TCP 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 86393 23543/java TCP 0 0 0 .0.0.0:50070 0.0.0.0:* LISTEN 1001 85964 23317/java TCP 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 86045 23389/java TCP 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 1001 86102 23389/java
If There are any errors, examine the log files in the <hadoop_install>/logs/directory.
Stopping your Single-node cluster
Run the command
hadoop@ubuntu:~$ <hadoop_install>/bin/stop-all.sh
To stop the daemons running on your machine.
Exemplary output:
hadoop@ubuntu:/usr/local/hadoop$ bin/stop-all.sh stopping Jobtracker localhost:ubuntu 8.04 Tasktracker stopping Namenode localhost:ubuntu 8.04, localhost:stopping datanode localhost:ubuntu 8.04: Stopping Secondarynamenode hadoop@ubuntu:/usr/local/hadoop$
Running a MapReduce job
We'll now run your The MapReduce job. We'll use the WordCount example job abound reads text files and counts how often words. The input is text files and the output are text files, each line of abound contains a word and the count of how often it occurred, Sep arated by a tab. More information of what happens behind the scenes are available at the Hadoop Wiki.
Download Example input data
We'll use three ebooks from Project Gutenberg to this example:
The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomsonthe notebooks of Leonardo Da vinciulysses by James Joyce
Download each ebook as plain text files in ASCII encoding and store the uncompressed files in a temporary directory of choice, F or Example/tmp/gutenberg.
hadoop@ubuntu:~$ ls/tmp/gutenberg/total 3592-rw-r--r--1 hadoop hadoop 674425 2007-01-22 12:56 20417-8.txt-rw-r--r--1 Hadoop Hadoop 1423808 2006-08-03 16:36 7ldvc10.txt-rw-r--r--1 hadoop hadoop 1561677 2004-11-26 09:48 ulyss12.txt hadoop@ubuntu:~$
Restart the Hadoop cluster
Restart your Hadoop cluster if it ' s not running already.
hadoop@ubuntu:~$ <hadoop_install>/bin/start-all.sh
Copy Local example data to HDFS
Unreported we run the actual MapReduce job, we have to copy of the files from our local file system to Hadoop ' s HDFS.
hadoop@ubuntu:/usr/local/hadoop$ Bin/hadoop Dfs-copyfromlocal/tmp/gutenberg Gutenberg hadoop@ubuntu:/usr/local/ hadoop$ bin/hadoop dfs-ls Found 1 items
/user/hadoop/gutenberg <dir> hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs-ls Gutenberg Found 3 items/user/ Hadoop/gutenberg/20417-8.txt <r 1> 674425/user/hadoop/gutenberg/7ldvc10.txt <r 1> Gutenberg/ulyss12.txt <r 1> 1561677
Run the MapReduce Job
Now, we actually run the WordCount example job.
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar Hadoop-0.18.0-examples.jar WordCount Gutenberg gutenberg-output
This command would read all of the files in the HDFS directory Gutenberg, process it, and store the HDFS directory Gutenbe Rg-output.
Exemplary output of the previous command in the console:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar Hadoop-0.18.0-examples.jar WordCount Gutenberg gutenberg-output
07/09/21 13:00:30 INFO mapred. Fileinputformat:total input paths to process:3 07/09/21 13:00:31 INFO mapred. Jobclient:running job:job_200709211255_0001 07/09/21 13:00:32 INFO mapred. Jobclient:map vs Reduce vs 07/09/21 13:00:42 INFO mapred. Jobclient:map 66% reduce vs 07/09/21 13:00:47 INFO mapred. Jobclient:map 100% reduce 22% 07/09/21 13:00:54 INFO mapred. Jobclient:map 100% reduce 100% 07/09/21 13:00:55 INFO mapred. Jobclient:job complete:job_200709211255_0001 07/09/21 13:00:55 INFO mapred. Jobclient:counters:12 07/09/21 13:00:55 INFO mapred. Jobclient:job counters 07/09/21 13:00:55 INFO mapred. jobclient:launched map tasks=3 07/09/21 13:00:55 INFO mapred. jobclient:launched reduce Tasks=1 07/09/21 13:00:55 INFO mapred. Jobclient:data-local map tasks=3 07/09/21 13:00:55 INFO mapred. Jobclient:map-reduce Framework 07/09/21 13:00:55 INFO mapred. Jobclient:map input records=77637 07/09/21 13:00:55 INFO mapred. Jobclient:map output records=628439 07/09/21 13:00:55 INFO mapred. JObclient:map input bytes=3659910 07/09/21 13:00:55 INFO mapred. Jobclient:map output bytes=6061344 07/09/21 13:00:55 INFO mapred. Jobclient:combine input records=628439 07/09/21 13:00:55 INFO mapred. Jobclient:combine output records=103910 07/09/21 13:00:55 INFO mapred. Jobclient:reduce input groups=85096 07/09/21 13:00:55 INFO mapred. Jobclient:reduce input records=103910 07/09/21 13:00:55 INFO mapred. Jobclient:reduce output records=85096 hadoop@ubuntu:/usr/local/hadoop$
Check If the result is successfully stored in HDFS directory Gutenberg-output:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs-ls Found 2 Items
/user/hadoop/gutenberg <dir>/user/hadoop/gutenberg-output <dir> hadoop@ubuntu:/usr/local/hadoop$ bin/ Hadoop dfs-ls gutenberg-output Found 1 items/user/hadoop/gutenberg-output/part-00000 <r 1> 903193 usr/local/hadoop$
If you are want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-d" option:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar Hadoop-0.18.0-examples.jar wordcount-d mapred.reduce.tasks=16 Gutenberg Gutenberg-output
An important note about the Mapred.map.tasks:Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn't ' t manipulate that. you cant force mapred.map.tasks but can specify Mapred.reduce.tasks.
Retrieve the job result from HDFS
To inspect the file, your can copy it from HDFS to the local file system. Alternatively, can use the command
hadoop@ubuntu:/usr/local/hadoop$ Bin/hadoop Dfs-cat gutenberg-output/part-00000
To read the "file directly from HDFS without copying it to the" local file system. In this tutorial, we'll copy the results to the local file system though.
hadoop@ubuntu:/usr/local/hadoop$ mkdir/tmp/gutenberg-output hadoop@ubuntu:/usr/local/hadoop$ Bin/hadoop DFS- Copytolocal gutenberg-output/part-00000/tmp/gutenberg-output hadoop@ubuntu:/usr/local/hadoop$ head/tmp/ gutenberg-output/part-00000 "(Lo) CRA" 1 "1490 1" 1498, "1" "1", "1" A 2 "As-is". 2 "A_ 1" Absoluti 1 "alack! 1 hadoop@ubuntu:/usr/local/hadoop$
Note So in this specific output the quote signs (") enclosing the words in the head output adjective have not been inserted by Hadoop. Tightly are the "result of" the word tokenizer used in the WordCount example, and in this case tightly matched the beginning of a quote in The ebook texts. Ethically inspect the part-00000 file further to the it for yourself. Hadoop WEB Interfaces
Hadoop comes with several web interfaces abound are by default ("= Conf/hadoop-default.xml") available at these locations:
http://localhost:50030/-web UI for MapReduce job Tracker (s)
http://localhost:50060/-Web UI for Task Tracker (s)
http://localhost:50070/-Web UI for HDFS name node (s)
These web interfaces provide concise information about what ' s happening in your Hadoop cluster. You are might want to give them a try.
MapReduce Job Tracker Web Interface
The Job Tracker Web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a Job history log file. It also gives access to the local machine's Hadoop log files (the machine on abound the Web UI was running on).
By default, it's available at http://localhost:50030/.
Figure 2: A screenshot of Hadoop ' s Job Tracker web interface.
Task Tracker Web Interface
The Task Tracker Web UI shows you running and non-running tasks. It also gives access to the local machine ' s Hadoop log files.
By default, it's available at http://localhost:50060/
Figure 3: A screenshot of Hadoop ' s Task Tracker web interface.
HDFS Name Node Web Interface
The name node Web UI shows you a cluster summary including information about total/remaining capacity, live and dead. Additionally, it allows you to browse the HDFS namespace and view the contents of the the Web browser. It also gives access to the local machine ' s Hadoop log files.
By default, it's available at http://localhost:50070/.
Figure 4:a screenshot of Hadoop ' s Name Node web interface.
What ' s next?
If you are feeling comfortable, you can re-enters your Hadoop experience with me follow-up tutorial Running Hadoop on Ubuntu Linux (Multi-node Cluster) where I describe how to build a Hadoop multi-node Cluster and nonblank Ubuntu boxes (this'll could your Current cluster size by 100%:-P).
In addition, I wrote tutorial on, I-code a simple MapReduce job in the Python programming language abound can serve as the BAS is for writing your own MapReduce programs.