Quick start to Hadoop

Last Update:2015-03-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective

The purpose of this document is to help you quickly complete the Hadoop installation and use on a single computer so that you can experience the Hadoop Distributed File System (HDFS) and the map-reduce framework, such as running sample programs or simple jobs on HDFS.

Prerequisite Support Platform GNU is a platform for product development and operation. Hadoop has been validated on a clustered system consisting of 2000-node GNU hosts. The WIN32 platform is supported as a development platform. Since distributed operations have not been fully tested on the Win32 platform, they are not supported as a production platform. Required Software

The Linux and Windows required software include:

javatm1.5.x, must be installed, the Java version of Sun released is recommended. SSH must be installed and guaranteed to run sshd to manage the remote Hadoop daemon with a Hadoop script.

Additional software requirements under Windows

Cygwin-Provides shell support beyond the above software. Install software

If your cluster does not have the required software installed, you must first install them.

Take Ubuntu Linux for example:

$ sudo apt install ssh
$ sudo apt install rsync

On the Windows platform, if you install Cygwin without all the required software installed, you need to start Cyqwin Setup Manager to install the following package:

openssh-net class Download

To get the release of Hadoop, download the most recent stable release from one of Apache's Mirror servers.

to run the Hadoop cluster

Unzip the downloaded Hadoop release. Editing the conf/hadoop-env.sh file requires at least the java_home to be set to the Java installation root path.

Try the following command:
$ bin/hadoop
The use document for the Hadoop script will be displayed.

Now you can start the Hadoop cluster in one of the following three supported modes:

The operation method of

single mode pseudo-distributed mode in complete distributed mode

By default, Hadoop is configured as a stand-alone Java process that runs in a non-distributed mode. This is very helpful for debugging.

The following example finds and displays an entry that matches a given regular expression, taking a copy of the uncompressed Conf directory as input. The output is written to the specified output directory.
$ mkdir Input
$ CP Conf/*.xml Input
$ bin/hadoop jar hadoop-*-examples.jar grep input Output ' dfs[a-z.] +'
$ cat output/*

The operation method of

pseudo-distributed mode

Hadoop can be run on a single node in so-called pseudo distributed mode, at which point every Hadoop daemon runs as a separate Java process.

Configuration

Use the following conf/hadoop-site.xml:

<configuration> <property> <name>fs.default.name</name> <value>localhost:9000</ value> </property> <property> <name>mapred.job.tracker</name> <value>localhost :9001</value> </property> <property> <name>dfs.replication</name> <value>1< /value> </property></configuration> Password-free SSH settings

Now confirm that you can login localhost with ssh without entering a password:
$ ssh localhost

If you do not enter a password to use SSH login localhost, execute the following command:
$ ssh-keygen-t dsa-p '-F ~/.SSH/ID_DSA
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Executive

Format a new Distributed File system:
$ bin/hadoop Namenode-format

To start the Hadoop daemon:
$ bin/start-all.sh

The log of the HADOOP daemon is written to the ${hadoop_log_dir} directory (default is ${hadoop_home}/logs).

Browse Namenode and Jobtracker network interfaces by default:

namenode-http://localhost:50070/jobtracker-http://localhost:50030/

To copy an input file to a distributed File system:
$ bin/hadoop fs-put conf input

To run the sample program provided by the release:
$ bin/hadoop jar hadoop-*-examples.jar grep input Output ' dfs[a-z.] +'

To view the output file:

Copy the output file from the Distributed file system to the local file system view:
$ bin/hadoop fs-get Output output
$ cat output/*

To view the output file on a distributed File system:
$ bin/hadoop Fs-cat output/*

After you complete the operation, stop the daemon:
$ bin/stop-all.sh

The operation method of

fully distributed mode

Information on a meaningful cluster of fully distributed patterns can be found here.

Java and JNI are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Quick start to Hadoop

Contact Us

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support