Build a hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Source: Internet
Author: User

I have been studying hadoop by myself recently. Today I am spending some time building a development environment and working out my documents.

First, you need to understand the hadoop running mode:

Standalone)
The standalone mode is the default mode of hadoop. When the source code package of hadoop is decompressed for the first time, hadoop does not know the hardware installation environment and conservatively selects the minimum configuration. In this default mode, all three XML files are empty. When the configuration file is empty, hadoop runs completely locally. Because it does not need to interact with other nodes, HDFS is not used in standalone mode, and any hadoop daemon process is not loaded. This mode is mainly used to develop and debug the application logic of mapreduce programs.
Pseudo Distribution Mode)
In pseudo-distribution mode, hadoop is run on a single-node cluster, and all daemon processes run on the same machine. This mode adds the code debugging function in standalone mode, allowing you to check the memory usage, HDFS input and output, and other daemon processes.
Full Distribution Mode)
The hadoop daemon runs on a cluster.

Version: Ubuntu 10.04.4, hadoop 1.0.2

1. Add a hadoop user to the System user

Before installation, add a user named hadoop to the system for hadoop testing.

~$ sudo addgroup hadoop~$ sudo adduser --ingroup hadoop hadoop

Now we only added a user hadoop, which does not have administrator permissions. Therefore, we need to add the user hadoop to the Administrator group:

 ~$ sudo usermod -aG admin hadoop


2. Install SSH

Because hadoop uses SSH for communication, first install SSH

 ~$ sudo apt-get install openssh-server

After the SSH installation is complete, start the service first:

 ~$ sudo /etc/init.d/ssh start 

After the service is started, run the following command to check whether the service is correctly started:

  ~$ ps -e | grep ssh

As a secure communication protocol, a password is required for use. Therefore, we need to set a password-free logon to generate a private key and a public key:

hadoop@scgm-ProBook:~$ ssh-keygen -t rsa -P ""

Because I already have a private key, I am prompted to overwrite the current private key. During the first operation, you will be prompted to enter the password and press Enter ~ /Home/{username }/. two files are generated under SSH: id_rsa and id_rsa.pub. The former is the private key and the latter is the public key, now We append the public key to authorized_keys (authorized_keys is used to save all the public key content that allows users to log on to the SSH client as the current user ):

 ~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now, you can log on to SSH to confirm that you do not need to enter the password when logging on:

~$ ssh localhost

Logout:

 ~$ exit

Second login:

 ~$ ssh localhost

Logout:

~$ exit

In this way, you do not need to enter the password for logon.

3. install Java

 ~$ sudo apt-get install openjdk-6-jdk ~$ java -version

4. Install hadoop 1.0.2

Download the hadoop source file from the official website. Select hadoop 1.0.2.

Decompress the package and put it in the desired directory. I put it in/usr/local/hadoop

~$ sudo tar xzf hadoop-1.0.2.tar.gz~$ sudo mv hadoop-1.0.2 /usr/local/hadoop

Make sure that all operations are completed under hadoop:

~$ sudo chown -R hadoop:hadoop /usr/local/hadoop


5. Set the hadoop-env.sh (Java installation path)

Go to the hadoop directory, open the conf directory, go to the hadoop-env.sh, and add the following information:
Export java_home =/usr/lib/JVM/java-6-openjdk (depending on your machine's Java installation path)
Export hadoop_home =/usr/local/hadoop
Export Path = $ path:/usr/local/hadoop/bin

Make the environment variable configuration take effect. Source

~$ source /usr/local/hadoop/conf/hadoop-env.sh

Now, the standalone mode of hadoop has been installed successfully.


Run the wordcount example provided by hadoop to experience the following mapreduce process:

Create an input folder in the hadoop directory

~$ mkdir input

Copy all files in conf to the Input Folder.

~$ cp conf/* input 

Run the wordcount program and save the result to output.

~$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input output

Run

 ~$ cat output/*

You will see that the word and frequency of all conf files are counted.

Below are some configurations required for the pseudo distribution mode. Continue.

6. Set *-site. xml
Here you need to set up 3 files: core-site.xml, hdfs-site.xml, mapred-site.xml, all under the/usr/local/hadoop/conf directory
Core-site.xml: configuration items for hadoop core, such as I/O settings commonly used by HDFS and mapreduce.
Hdfs-site.xml: configuration items for the hadoop daemon, including namenode, secondary namenode and datanode.
Mapred-site.xml: configuration items for the mapreduce daemon, including jobtracker and tasktracker.

First, create Several folders in the hadoop directory.

~/hadoop$ mkdir tmp~/hadoop$ mkdir hdfs~/hadoop$ mkdir hdfs/name~/hadoop$ mkdir hdfs/data

Edit the three files:

Core-site.xml:

<configuration>    <property>        <name>fs.default.name</name>        <value>hdfs://localhost:9000</value>    </property>    <property>        <name>hadoop.tmp.dir</name>        <value>/usr/local/hadoop/tmp</value>    </property></configuration>

Hdfs-site.xml:

<configuration>    <property>        <name>dfs.replication</name>        <value>1</value>    </property>    <property>        <name>dfs.name.dir</name>        <value>/usr/local/hadoop/hdfs/name</value>    </property>    <property>        <name>dfs.data.dir</name>        <value>/usr/local/hadoop/hdfs/data</value>    </property></configuration>

Mapred-site.xml:

<configuration>    <property>        <name>mapred.job.tracker</name>        <value>localhost:9001</value>    </property></configuration>

7. Format HDFS

Through the above steps, we have set the hadoop standalone test to the environment, and then start hadoop to related services, format namenode, secondarynamenode, tasktracker:

~$ source /usr/local/hadoop/conf/hadoop-env.sh~$ hadoop namenode -format


8. Start hadoop

Then run the start-all.sh to start all services, including namenode, datanode, and start-all.sh scripts to load the daemon.

hadoop@ubuntu:/usr/local/hadoop$ cd binhadoop@ubuntu:/usr/local/hadoop/bin$ start-all.sh

Use the Java JPs command to list all daemon to verify successful installation.

hadoop@ubuntu:/usr/local/hadoop$ jps

The following list is displayed, indicating that the operation is successful.

9. Check the running status
All the settings have been completed, and hadoop has been started. Now you can use the following operations to check whether the service is normal. In hadoop, the Web interface is used to monitor the health status of the cluster:
Http: // localhost: 50030/-hadoop Management Interface
Http: // localhost: 50060/-hadoop task tracker status
Http: // localhost: 50070/-hadoop DFS status

Hadoop management interface:

Hadoop task tracker status:

Hadoop DFS status:

So far, hadoop's pseudo-distribution mode has been successfully installed, so run the wordcount example of hadoop in pseudo-distribution mode again to feel the following mapreduce process:

Note that the program runs in DFS of the file system, and the files created are also based on the file system:

First, create the input directory in DFS.

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -mkdir input

Copy the file in conf to the input file in DFS.

hadoop@ubuntu:/usr/local/hadoop$ hadoop dfs -copyFromLocal conf/* input

Run wordcount in pseudo-distributed mode

hadoop@ubuntu:/usr/local/hadoop$ hadoop jar hadoop-examples-1.0.2.jar wordcount input output

You can see the following process

Display output results

hadoop@ubuntu:/usr/local/hadoop$ hadoop dfs -cat output/*

When hadoop ends, you can close the hadoop daemon through the stop-all.sh script

hadoop@ubuntu:/usr/local/hadoop$ bin/stop-all.sh 

10. Conclusion

Hadoop is successfully built on Ubuntu! I'm a little excited. I can't wait to start some related development and have a deep understanding of hadoop kernel implementation. Continue!

PS: Both standalone and pseudo distribution modes are used for development and debugging. The real hadoop cluster runs in the third mode, that is, the full distribution mode. To be continued.

This article references the articles of the two students. Thank you for sharing them!

Http://blog.sina.com.cn/s/blog_61ef49250100uvab.html

Http://www.cnblogs.com/welbeckxu/category/346329.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.