Distributed data processing with Hadoop, part 1th

Source: Internet
Author: User
Tags curl ssh

Although Hadoop is a core part of some large search engine data reduction capabilities, it is actually a distributed data processing framework. Search engines need to collect data, and it's a huge amount of data. As a distributed framework, Hadoop enables many applications to benefit from parallel data processing.

Instead of introducing Hadoop and its architecture, this article demonstrates a simple Hadoop setting. Now, let's talk about the installation and configuration of Hadoop.

Initial settings

For the example in this article, we use the Cloudera Hadoop release. Cloudera offers support for a variety of linux® distributions, so it's ideal for beginners.

This article assumes that Java™ (at least version 1.6) and CURL are already installed on your system. If they are not, they need to be installed first.

Because I run Ubuntu (intrepid version), I use the apt utility to get the Hadoop release. This process is very simple, I can get binary package, without the need to download and build the source code. First, tell apt Cloudera site information. Then, create a new file in/etc/apt/sources.list.d/cloudera.list and add the following text:

deb http://archive.cloudera.com/debian intrepid-cdh3  contrib 
deb-src http://archive.cloudera.com/debian intrepid- cdh3 contrib 

If you are running jaunty or another version, simply replace intrepid with your version name (currently supports Hardy, Intrepid, jaunty, Karmic, and Lenny).

Next, get Apt-key from Cloudera to check the downloaded package:

$ curl -s  http://archive.cloudera.com/debian/archive.key | \ 
sudo apt -key add - sudo apt-get update 

Then, install Hadoop with a pseudo distributed configuration (all Hadoop daemons run on the same host):

$ sudo apt-get install hadoop-0.20-conf-pseudo
$

Note that this configuration is approximately 23MB (excluding other packages that apt may download). This configuration is ideal for experiencing Hadoop and understanding its elements and interfaces.

Finally, I set up SSH that doesn't require a password. If you plan to use SSH localhost and request a password, you need to perform the following steps. I assume this is a dedicated Hadoop machine because this step has an impact on security (see Listing 1).

Listing 1. Set SSH with no password required

$ sudo su -
# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys 

Finally, you need to ensure that there is sufficient storage space (cache) on the host for Datanode to use. A lack of storage space can result in system performance anomalies (such as errors that cannot replicate data to nodes).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.