Set up Hadoop environment on Ubuntu (stand-alone mode + pseudo distribution mode)

Source: Internet
Author: User
Tags memory usage mkdir ssh

I've been learning about Hadoop recently, and today I've spent some time building a development environment and documenting it.

First, learn about the running mode of Hadoop:

Stand-alone mode (standalone)
Stand-alone mode is the default mode for Hadoop. When Hadoop's source package was first decompressed, it was not able to understand the hardware installation environment, and the minimum configuration was conservatively chosen. All 3 XML files are empty in this default mode. When the configuration file is empty, Hadoop runs completely locally. Because there is no need to interact with other nodes, stand-alone mode does not use HDFS, nor does it load any Hadoop daemon. This model is mainly used to develop the application logic of debugging MapReduce program.
Pseudo distribution pattern (pseudo-distributed mode)
Pseudo-distribution mode runs Hadoop on a "single node Cluster" where all daemons run on the same machine. This mode adds code debugging on top of stand-alone mode, allowing you to check memory usage, HDFS input output, and other daemon interactions.
Full distribution pattern (fully distributed mode)
The Hadoop daemon runs on a cluster.


Version: Ubuntu 10.04.4,hadoop 1.0.2

1. Add Hadoop user to System user

One thing to do before you install--add a user named Hadoop to the system to do the Hadoop test.

~$ sudo addgroup hadoop
~$ sudo adduser--ingroup hadoop Hadoop

Now just add a user Hadoop, it does not have administrator rights, so we need to add user Hadoop to the Administrators group:

~$ sudo usermod-ag admin Hadoop


2. Install SSH

Because Hadoop is communicating with SSH, SSH is installed first

~$ sudo apt-get install Openssh-server

After the SSH installation completes, start the service first:

After startup, you can see if the service starts correctly by following these commands:
  ~$ PS-E | grep ssh


As a secure communication protocol, the password is required to be used, so we want to set it as a password-free login to generate the private key and the public key:

hadoop@scgm-probook:~$ ssh-keygen-t rsa-p ""


Because I already have a private key, I am prompted to overwrite the current private key. The first time you will be prompted to enter a password, press Enter direct pass, at this time will generate two files under ~/home/{username}/.ssh: Id_rsa and Id_rsa.pub, the former is the private key, the latter is the public key, now we append the public key to the Authorized_ Keys (Authorized_keys is used to save all public key content that is allowed to log on to the SSH client user as the current user):

~$ Cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
You can now log in to SSH to confirm that you do not need to enter your password when you sign in:
~$ ssh localhost

Log out:

~$ exit
Second logon:
~$ ssh localhost


Log out:

~$ exit
This will not have to enter the password after the login.


3. Install Java

~$ sudo apt-get install openjdk-6-jdk
 ~$ java-version



4. Installing Hadoop 1.0.2

To the official website to download the Hadoop source file, here choose Hadoop 1.0.2

Unzip and put it in the directory you want. I put it in the/usr/local/hadoop.

~$ sudo tar xzf hadoop-1.0.2.tar.gz
~$ sudo mv Hadoop-1.0.2/usr/local/hadoop
Make sure that all of the actions are done under user hadoop:
~$ sudo chown-r hadoop:hadoop/usr/local/hadoop


5. Set hadoop-env.sh (Java installation path)

Enter the Hadoop directory, open the Conf directory to hadoop-env.sh, and add the following information:
Export JAVA_HOME=/USR/LIB/JVM/JAVA-6-OPENJDK (depending on your machine's JAVA installation path)
Export Hadoop_home=/usr/local/hadoop
Export path= $PATH:/usr/local/hadoop/bin



And, let the environment variable be configured to take effect source

~$ source/usr/local/hadoop/conf/hadoop-env.sh

at this point, the stand-alone mode of Hadoop has been installed successfully.



So, run the example wordcount from Hadoop to feel the following MapReduce process:

Create a new input folder in the Hadoop directory

~$ mkdir Input
Copy all files in conf to the input folder
~$ CP conf/* Input
Run the WordCount program and save the results to output
~$ bin/hadoop jar Hadoop-0.20.2-examples.jar wordcount Input Output


Run

~$ Cat output/*
You will see that the words and frequencies of all the files in conf are counted.


The following are some of the configurations required by the pseudo distribution pattern to continue.

6. Set *-site.xml
Here you need to set 3 files: Core-site.xml,hdfs-site.xml,mapred-site.xml, all in the/usr/local/hadoop/conf directory
Core-site.xml:hadoop core configuration items, such as HDFs and MapReduce common I/O settings.
Configuration items for the Hdfs-site.xml:hadoop daemon, including Namenode, auxiliary namenode, and Datanode.
Configuration items for the Mapred-site.xml:mapreduce daemon, including Jobtracker and Tasktracker.

Start with a few new folders in the Hadoop directory

~/hadoop$ mkdir tmp
~/hadoop$ mkdir hdfs
~/hadoop$ mkdir hdfs/name ~/hadoop$ mkdir


Next, edit the three files:

Core-site.xml:

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs ://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</ name>
        <value>/usr/local/hadoop/tmp</value>
    </property>
</configuration>
Hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1 </value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>/usr/local/hadoop/hdfs/name</value>
    </property>
    <property>
        < name>dfs.data.dir</name>
        <value>/usr/local/hadoop/hdfs/data</value>
    </ Property>
</configuration>
Mapred-site.xml:

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value> localhost:9001</value>
    </property>
</configuration>


7. Format HDFs
With the above steps, we have set the Hadoop stand-alone test to the environment, followed by the launch of Hadoop to the related services, format namenode,secondarynamenode,tasktracker:

~$ source/usr/local/hadoop/conf/hadoop-env.sh
~$ Hadoop namenode-format



8. Start Hadoop

The start-all.sh is then executed to start all services, including the namenode,datanode,start-all.sh script to load the daemon.

hadoop@ubuntu:/usr/local/hadoop$ CD bin
hadoop@ubuntu:/usr/local/hadoop/bin$ start-all.sh


List all daemons with the Java JPS command to verify installation success

Hadoop@ubuntu:/usr/local/hadoop$ JPS
The following list appears, indicating success


9. Check running status
All the settings are complete and Hadoop is started, and you can now see if the services are working properly and the Web interface used in Hadoop to monitor the health of the cluster:
HTTP://LOCALHOST:50030/-Hadoop Management interface
http://localhost:50060/-Hadoop Task Tracker State
http://localhost:50070/-Hadoop DFS State

Hadoop Management Interface:


Hadoop Task Tracker State:


Hadoop DFS Status:


at this point, the pseudo distribution mode of Hadoop has been successfully installed, so once again in pseudo distribution mode run the example of Hadoop wordcount to feel the following MapReduce process:

Note that the program is running in the file system Dfs and the files created are also based on the file system:

First create the input directory in DFS

hadoop@ubuntu:/usr/local/hadoop$ Bin/hadoop Dfs-mkdir Input
Copy files from conf to the input in DFS
hadoop@ubuntu:/usr/local/hadoop$ Hadoop dfs-copyfromlocal conf/* input
Running WordCount in pseudo-distributed mode
hadoop@ubuntu:/usr/local/hadoop$ Hadoop jar Hadoop-examples-1.0.2.jar wordcount Input Output
You can see the following procedure


Show output results

hadoop@ubuntu:/usr/local/hadoop$ Hadoop dfs-cat output/*


When Hadoop is over, you can close the Hadoop daemon by stop-all.sh script


10. Conclusion

Build Hadoop successfully on Ubuntu. A little excited, already can't wait to start to do some related development and in-depth understanding of the Hadoop core implementation, continue to refuel slightly.

PS: Stand-alone mode and pseudo distribution mode are all used for development and debugging purposes. The real Hadoop cluster is run using a third pattern, the full distribution pattern. Cond.


This article refers to the two students of the article, thanks for sharing ah.

Http://blog.sina.com.cn/s/blog_61ef49250100uvab.html

Http://www.cnblogs.com/welbeckxu/category/346329.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.