Quick start to Hadoop

Source: Internet
Author: User
Keywords Nbsp; ssh name installation
Tags cat computer configuration copy default development directory distributed
Objective

The purpose of this document is to help you quickly complete the Hadoop installation and use on a single computer so that you can experience the Hadoop Distributed File System (HDFS) and the map-reduce framework, such as running sample programs or simple jobs on HDFS.

Prerequisite Support Platform GNU is a platform for product development and operation. Hadoop has been validated on a clustered system consisting of 2000-node GNU hosts. The WIN32 platform is supported as a development platform. Since distributed operations have not been fully tested on the Win32 platform, they are not supported as a production platform. Required Software

The Linux and Windows required software include:

javatm1.5.x, must be installed, the Java version of Sun released is recommended. SSH must be installed and guaranteed to run sshd to manage the remote Hadoop daemon with a Hadoop script.

Additional software requirements under Windows

Cygwin-Provides shell support beyond the above software. Install software

If your cluster does not have the required software installed, you must first install them.

Take Ubuntu Linux for example:

$ sudo apt install ssh
$ sudo apt install rsync

On the Windows platform, if you install Cygwin without all the required software installed, you need to start Cyqwin Setup Manager to install the following package:

openssh-net class Download

To get the release of Hadoop, download the most recent stable release from one of Apache's Mirror servers.

to run the Hadoop cluster

Unzip the downloaded Hadoop release. Editing the conf/hadoop-env.sh file requires at least the java_home to be set to the Java installation root path.

Try the following command:
$ bin/hadoop
The use document for the Hadoop script will be displayed.

Now you can start the Hadoop cluster in one of the following three supported modes:

The operation method of

single mode pseudo-distributed mode in complete distributed mode

By default, Hadoop is configured as a stand-alone Java process that runs in a non-distributed mode. This is very helpful for debugging.

The following example finds and displays an entry that matches a given regular expression, taking a copy of the uncompressed Conf directory as input. The output is written to the specified output directory.
$ mkdir Input
$ CP Conf/*.xml Input
$ bin/hadoop jar hadoop-*-examples.jar grep input Output ' dfs[a-z.] +'
$ cat output/*

The operation method of

pseudo-distributed mode

Hadoop can be run on a single node in so-called pseudo distributed mode, at which point every Hadoop daemon runs as a separate Java process.

Configuration

Use the following conf/hadoop-site.xml:

<configuration> <property> <name>fs.default.name</name> <value>localhost:9000</ value> </property> <property> <name>mapred.job.tracker</name> <value>localhost :9001</value> </property> <property> <name>dfs.replication</name> <value>1< /value> </property></configuration> Password-free SSH settings

Now confirm that you can login localhost with ssh without entering a password:
$ ssh localhost

If you do not enter a password to use SSH login localhost, execute the following command:
$ ssh-keygen-t dsa-p '-F ~/.SSH/ID_DSA
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Executive

Format a new Distributed File system:
$ bin/hadoop Namenode-format

To start the Hadoop daemon:
$ bin/start-all.sh

The log of the HADOOP daemon is written to the ${hadoop_log_dir} directory (default is ${hadoop_home}/logs).

Browse Namenode and Jobtracker network interfaces by default:

namenode-http://localhost:50070/jobtracker-http://localhost:50030/

To copy an input file to a distributed File system:
$ bin/hadoop fs-put conf input

To run the sample program provided by the release:
$ bin/hadoop jar hadoop-*-examples.jar grep input Output ' dfs[a-z.] +'

To view the output file:

Copy the output file from the Distributed file system to the local file system view:
$ bin/hadoop fs-get Output output
$ cat output/*

Or

To view the output file on a distributed File system:
$ bin/hadoop Fs-cat output/*

After you complete the operation, stop the daemon:
$ bin/stop-all.sh

The operation method of

fully distributed mode

Information on a meaningful cluster of fully distributed patterns can be found here.

Java and JNI are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.