Set up Hadoop environment on Ubuntu (stand-alone mode + pseudo distribution mode)

Last Update:2018-07-19 Source: Internet

Author: User

Tags memory usage mkdir ssh

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I've been learning about Hadoop recently, and today I've spent some time building a development environment and documenting it.

First, learn about the running mode of Hadoop:

Stand-alone mode (standalone)
Stand-alone mode is the default mode for Hadoop. When Hadoop's source package was first decompressed, it was not able to understand the hardware installation environment, and the minimum configuration was conservatively chosen. All 3 XML files are empty in this default mode. When the configuration file is empty, Hadoop runs completely locally. Because there is no need to interact with other nodes, stand-alone mode does not use HDFS, nor does it load any Hadoop daemon. This model is mainly used to develop the application logic of debugging MapReduce program.
Pseudo distribution pattern (pseudo-distributed mode)
Pseudo-distribution mode runs Hadoop on a "single node Cluster" where all daemons run on the same machine. This mode adds code debugging on top of stand-alone mode, allowing you to check memory usage, HDFS input output, and other daemon interactions.
Full distribution pattern (fully distributed mode)
The Hadoop daemon runs on a cluster.

Version: Ubuntu 10.04.4,hadoop 1.0.2

1. Add Hadoop user to System user

One thing to do before you install--add a user named Hadoop to the system to do the Hadoop test.

~$ sudo addgroup hadoop
~$ sudo adduser--ingroup hadoop Hadoop

Now just add a user Hadoop, it does not have administrator rights, so we need to add user Hadoop to the Administrators group:

~$ sudo usermod-ag admin Hadoop

2. Install SSH

Because Hadoop is communicating with SSH, SSH is installed first

~$ sudo apt-get install Openssh-server

After the SSH installation completes, start the service first:

After startup, you can see if the service starts correctly by following these commands:

  ~$ PS-E | grep ssh

As a secure communication protocol, the password is required to be used, so we want to set it as a password-free login to generate the private key and the public key:

hadoop@scgm-probook:~$ ssh-keygen-t rsa-p ""

Because I already have a private key, I am prompted to overwrite the current private key. The first time you will be prompted to enter a password, press Enter direct pass, at this time will generate two files under ~/home/{username}/.ssh: Id_rsa and Id_rsa.pub, the former is the private key, the latter is the public key, now we append the public key to the Authorized_ Keys (Authorized_keys is used to save all public key content that is allowed to log on to the SSH client user as the current user):

~$ Cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

You can now log in to SSH to confirm that you do not need to enter your password when you sign in:

~$ ssh localhost

Log out:

~$ exit

Second logon:

~$ ssh localhost

Log out:

~$ exit

This will not have to enter the password after the login.

3. Install Java

~$ sudo apt-get install openjdk-6-jdk
 ~$ java-version

4. Installing Hadoop 1.0.2

To the official website to download the Hadoop source file, here choose Hadoop 1.0.2

Unzip and put it in the directory you want. I put it in the/usr/local/hadoop.

~$ sudo tar xzf hadoop-1.0.2.tar.gz
~$ sudo mv Hadoop-1.0.2/usr/local/hadoop

Make sure that all of the actions are done under user hadoop:

~$ sudo chown-r hadoop:hadoop/usr/local/hadoop

5. Set hadoop-env.sh (Java installation path)

Enter the Hadoop directory, open the Conf directory to hadoop-env.sh, and add the following information:
Export JAVA_HOME=/USR/LIB/JVM/JAVA-6-OPENJDK (depending on your machine's JAVA installation path)
Export Hadoop_home=/usr/local/hadoop
Export path= $PATH:/usr/local/hadoop/bin

And, let the environment variable be configured to take effect source

~$ source/usr/local/hadoop/conf/hadoop-env.sh

at this point, the stand-alone mode of Hadoop has been installed successfully.

So, run the example wordcount from Hadoop to feel the following MapReduce process:

Create a new input folder in the Hadoop directory

~$ mkdir Input

Copy all files in conf to the input folder

~$ CP conf/* Input

Run the WordCount program and save the results to output

~$ bin/hadoop jar Hadoop-0.20.2-examples.jar wordcount Input Output

Run

~$ Cat output/*

You will see that the words and frequencies of all the files in conf are counted.

The following are some of the configurations required by the pseudo distribution pattern to continue.

6. Set *-site.xml
Here you need to set 3 files: Core-site.xml,hdfs-site.xml,mapred-site.xml, all in the/usr/local/hadoop/conf directory
Core-site.xml:hadoop core configuration items, such as HDFs and MapReduce common I/O settings.
Configuration items for the Hdfs-site.xml:hadoop daemon, including Namenode, auxiliary namenode, and Datanode.
Configuration items for the Mapred-site.xml:mapreduce daemon, including Jobtracker and Tasktracker.

Start with a few new folders in the Hadoop directory

~/hadoop$ mkdir tmp
~/hadoop$ mkdir hdfs
~/hadoop$ mkdir hdfs/name ~/hadoop$ mkdir

Next, edit the three files:

Core-site.xml:

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs ://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</ name>
        <value>/usr/local/hadoop/tmp</value>
    </property>
</configuration>

Hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1 </value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>/usr/local/hadoop/hdfs/name</value>
    </property>
    <property>
        < name>dfs.data.dir</name>
        <value>/usr/local/hadoop/hdfs/data</value>
    </ Property>
</configuration>

Mapred-site.xml:

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value> localhost:9001</value>
    </property>
</configuration>

7. Format HDFs
With the above steps, we have set the Hadoop stand-alone test to the environment, followed by the launch of Hadoop to the related services, format namenode,secondarynamenode,tasktracker:

~$ source/usr/local/hadoop/conf/hadoop-env.sh
~$ Hadoop namenode-format

8. Start Hadoop

The start-all.sh is then executed to start all services, including the namenode,datanode,start-all.sh script to load the daemon.

hadoop@ubuntu:/usr/local/hadoop$ CD bin
hadoop@ubuntu:/usr/local/hadoop/bin$ start-all.sh

List all daemons with the Java JPS command to verify installation success

Hadoop@ubuntu:/usr/local/hadoop$ JPS

The following list appears, indicating success

9. Check running status
All the settings are complete and Hadoop is started, and you can now see if the services are working properly and the Web interface used in Hadoop to monitor the health of the cluster:
HTTP://LOCALHOST:50030/-Hadoop Management interface
http://localhost:50060/-Hadoop Task Tracker State
http://localhost:50070/-Hadoop DFS State

Hadoop Management Interface:

Hadoop Task Tracker State:

Hadoop DFS Status:

at this point, the pseudo distribution mode of Hadoop has been successfully installed, so once again in pseudo distribution mode run the example of Hadoop wordcount to feel the following MapReduce process:

Note that the program is running in the file system Dfs and the files created are also based on the file system:

First create the input directory in DFS

hadoop@ubuntu:/usr/local/hadoop$ Bin/hadoop Dfs-mkdir Input

Copy files from conf to the input in DFS

hadoop@ubuntu:/usr/local/hadoop$ Hadoop dfs-copyfromlocal conf/* input

Running WordCount in pseudo-distributed mode

hadoop@ubuntu:/usr/local/hadoop$ Hadoop jar Hadoop-examples-1.0.2.jar wordcount Input Output

You can see the following procedure

Show output results

hadoop@ubuntu:/usr/local/hadoop$ Hadoop dfs-cat output/*

When Hadoop is over, you can close the Hadoop daemon by stop-all.sh script

10. Conclusion

Build Hadoop successfully on Ubuntu. A little excited, already can't wait to start to do some related development and in-depth understanding of the Hadoop core implementation, continue to refuel slightly.

PS: Stand-alone mode and pseudo distribution mode are all used for development and debugging purposes. The real Hadoop cluster is run using a third pattern, the full distribution pattern. Cond.

This article refers to the two students of the article, thanks for sharing ah.

Http://blog.sina.com.cn/s/blog_61ef49250100uvab.html

Http://www.cnblogs.com/welbeckxu/category/346329.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More