Distributed data processing with Hadoop, part 1th

Last Update:2017-02-27 Source: Internet

Author: User

Tags curl ssh

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Although Hadoop is a core part of some large search engine data reduction capabilities, it is actually a distributed data processing framework. Search engines need to collect data, and it's a huge amount of data. As a distributed framework, Hadoop enables many applications to benefit from parallel data processing.

Instead of introducing Hadoop and its architecture, this article demonstrates a simple Hadoop setting. Now, let's talk about the installation and configuration of Hadoop.

Initial settings

For the example in this article, we use the Cloudera Hadoop release. Cloudera offers support for a variety of linux® distributions, so it's ideal for beginners.

This article assumes that Java™ (at least version 1.6) and CURL are already installed on your system. If they are not, they need to be installed first.

Because I run Ubuntu (intrepid version), I use the apt utility to get the Hadoop release. This process is very simple, I can get binary package, without the need to download and build the source code. First, tell apt Cloudera site information. Then, create a new file in/etc/apt/sources.list.d/cloudera.list and add the following text:

deb　http://archive.cloudera.com/debian　intrepid-cdh3　 contrib　 deb-src　http://archive.cloudera.com/debian　intrepid- cdh3　contrib　

If you are running jaunty or another version, simply replace intrepid with your version name (currently supports Hardy, Intrepid, jaunty, Karmic, and Lenny).

Next, get Apt-key from Cloudera to check the downloaded package:

$　curl　-s　 http://archive.cloudera.com/debian/archive.key　|　\　 sudo　apt -key　add　-　sudo　apt-get　update　

Then, install Hadoop with a pseudo distributed configuration (all Hadoop daemons run on the same host):

$　sudo　apt-get　install　hadoop-0.20-conf-pseudo $

Note that this configuration is approximately 23MB (excluding other packages that apt may download). This configuration is ideal for experiencing Hadoop and understanding its elements and interfaces.

Finally, I set up SSH that doesn't require a password. If you plan to use SSH localhost and request a password, you need to perform the following steps. I assume this is a dedicated Hadoop machine because this step has an impact on security (see Listing 1).

Listing 1. Set SSH with no password required

$　sudo　su　- #　ssh-keygen　-t　dsa　-P　''　-f　~/.ssh/id_dsa #　cat　~/.ssh/id_dsa.pub　>>　~/.ssh/authorized_keys　

Finally, you need to ensure that there is sufficient storage space (cache) on the host for Datanode to use. A lack of storage space can result in system performance anomalies (such as errors that cannot replicate data to nodes).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

hadoop distributed file system pdf hadoop distributed file system hdfs web distributed data exchange hadoop distributed file system architecture and design hadoop can only work with structured data true or false big data analysis recommendation system with hadoop framework data format in hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Distributed data processing with Hadoop, part 1th

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support