Although Hadoop is a core part of some large search engine data reduction capabilities, it is actually a distributed data processing framework. Search engines need to collect data, and it's a huge amount of data. As a distributed framework, Hadoop enables many applications to benefit from parallel data processing.
Instead of introducing Hadoop and its architecture, this article demonstrates a simple Hadoop setting. Now, let's talk about the installation and configuration of Hadoop.
Initial settings
For the example in this article, we use the Cloudera Hadoop release. Cloudera offers support for a variety of linux® distributions, so it's ideal for beginners.
This article assumes that Java™ (at least version 1.6) and CURL are already installed on your system. If they are not, they need to be installed first.
Because I run Ubuntu (intrepid version), I use the apt utility to get the Hadoop release. This process is very simple, I can get binary package, without the need to download and build the source code. First, tell apt Cloudera site information. Then, create a new file in/etc/apt/sources.list.d/cloudera.list and add the following text:
deb http://archive.cloudera.com/debian intrepid-cdh3 contrib
deb-src http://archive.cloudera.com/debian intrepid- cdh3 contrib
If you are running jaunty or another version, simply replace intrepid with your version name (currently supports Hardy, Intrepid, jaunty, Karmic, and Lenny).
Next, get Apt-key from Cloudera to check the downloaded package:
$ curl -s http://archive.cloudera.com/debian/archive.key | \
sudo apt -key add - sudo apt-get update
Then, install Hadoop with a pseudo distributed configuration (all Hadoop daemons run on the same host):
$ sudo apt-get install hadoop-0.20-conf-pseudo
$
Note that this configuration is approximately 23MB (excluding other packages that apt may download). This configuration is ideal for experiencing Hadoop and understanding its elements and interfaces.
Finally, I set up SSH that doesn't require a password. If you plan to use SSH localhost and request a password, you need to perform the following steps. I assume this is a dedicated Hadoop machine because this step has an impact on security (see Listing 1).
Listing 1. Set SSH with no password required
$ sudo su -
# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Finally, you need to ensure that there is sufficient storage space (cache) on the host for Datanode to use. A lack of storage space can result in system performance anomalies (such as errors that cannot replicate data to nodes).