Hadoop Single Node Cluster Setup

Source: Internet
Author: User
Keywords hadoop single node cluster setup apache hadoop single node cluster setup hadoop single node cluster setup on ubuntu
This article describes how to quickly build a Hadoop stand-alone operating environment so that you can use MapReduce and Hadoop Distributed File System (HDFS) to run some simple jobs.
Simple Application Server
USD1.00 New User Coupon
* Only 3,000 coupons available.
* Each new user can only get one coupon(except users from distributors).
* The coupon is valid for 30 days from the date of receipt.

premise
Platform support
GNU/Linux is the platform for product development and operation. Hadoop has been verified on a cluster system composed of GNU/Linux hosts with 2000 nodes.
Hadoop is also supported on the Windows platform, but the next steps only apply to Linux. For how to configure Hadoop on Windows please see here.
Required software
For the Linux platform, the required software is as follows:

Java™, please refer to HadoopJavaVersions for recommended Java version.
ssh, and the sshd service must be running normally in order to manage remote Hadoop daemons with Hadoop scripts.
install software
If the required software is not installed in your cluster, you must install them first.

Take Ubuntu Linux as an example:

Install Java
Refer to the correct installation posture of JDK

Install ssh

$ sudo apt-get install ssh
$ sudo apt-get install rsync
download
Download the distribution of Hadoop here.

Preparation for running Hadoop cluster
Unzip the downloaded installation file (I downloaded hadoop-2.7.3.tar.gz here):

$ tar zxvf hadoop-2.7.3.tar.gz
Edit the etc/hadoop/hadoop-env.sh file and change JAVA_HOME to your Java installation directory:

export JAVA_HOME="Your Java installation directory"
Try the following command:

$ bin/hadoop
This will show the usage of the hadoop script.

Now, the preparations are done. You can use any of the following methods to start cluster construction:

Local (stand-alone) mode
Pseudo-distributed mode
Fully distributed mode
Stand-alone mode
Hadoop runs in non-distributed mode as an independent Java process by default. This is very helpful for debugging.

The following example demonstrates how to run Hadoop in stand-alone mode (all operation paths are relative to the Hadoop installation directory):

// Create input directory
$ mkdir input
// Copy some files to the input directory as input
$ cp etc/hadoop/*.xml input
// Execute a mapreduce job that takes all the files in the input directory as input, and the string matching the pattern'dfs[a-z.]+' as output
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output'dfs[a-z.]+'
// View output results
$ cat output/*
Pseudo-distributed mode
Hadoop can also run in a pseudo-distributed mode on a node, at this time each Hadoop daemon runs as a separate Java process.

Configuration
Make the following configuration:
etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
Set up ssh password-free login
Try to enter the following command:

$ ssh localhost
If you cannot connect to localhost via ssh without entering a password, please execute the following command:

$ ssh-keygen -t rsa -P'' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
carried out
Next, it shows how to execute a local MapReduce job.

Format the file system:
$ bin/hdfs namenode -format
Start the NameNode daemon and DataNode daemon:
$ sbin/start-dfs.sh
The Hadoop daemon will write logs to the $HADOOP_LOG_DIR directory (by default, the $HADOOP_HOME/logs directory)

View the running status of Hadoop in the browser through the web interface; by default, it can be accessed through the following address:
NameNode-http://localhost:50070/
Create directories in HDFS, and MapReduce jobs need to read data from these directories:
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>
Upload files from local to HDFS:
$ bin/hdfs dfs -put etc/hadoop input
Run the example provided by Hadoop:
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output'dfs[a-z.]+'
Download the output file from HDFS to the local and verify the result is correct:
$ bin/hdfs dfs -get output output
$ cat output/*
Or view the output file directly on HDFS:

$ bin/hdfs dfs -cat output/*
After completion, the daemon can be stopped by the following command:
$ sbin/stop-dfs.sh
Run on YARN architecture
You can execute MapReduce jobs in pseudo-distributed mode with YARN architecture with very little configuration. In addition, the ResourceManager daemon and NodeManager daemon are also started.

The following steps are based on steps 1 to 4 in the previous section.

Make the following configuration:
etc/hadoop/mapred-site.xml:
<configuration>
       <property>
           <name>mapreduce.framework.name</name>
           <value>yarn</value>
       </property>
</configuration>
etc/hadoop/yarn-site.xml:

<configuration>
       <property>
           <name>yarn.nodemanager.aux-services</name>
           <value>mapreduce_shuffle</value>
       </property>
</configuration>
Start the ResourceManager daemon and NodeManager daemon:
$ sbin/start-yarn.sh
Check whether ResourceManager is running normally in the browser; the default is on port 8088:
ResourceManager-http://localhost:8088/
Execute MapReduce job.
After execution, the daemon can be stopped by the following command:

$ sbin/stop-yarn.sh

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.