Install and configure Hadoop in Linux

Last Update:2015-07-25 Source: Internet

Author: User

Tags ssh account ssh server

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Install and configure Hadoop in Linux

Before installing Hadoop on Linux, you need to install two programs:

JDK 1.6 or later;
We recommend that you install OpenSSH for SSH (Secure Shell protocol ).

The following describes the reasons for installing these two programs:

Hadoop is developed using Java. JDK is required for compiling Hadoop and running MapReduce.

Hadoop needs to use SSH to start the daemon processes of each host in the salve list. Therefore, SSH must be installed, even if the pseudo-distributed version is installed (because Hadoop does not distinguish cluster-based and pseudo-distributed ). For pseudo-distributed systems, Hadoop uses the same processing method as the cluster, that is, starting the processes on the host recorded in file conf/slaves in order, except that salves is localhost (itself) in pseudo-distributed systems ), therefore, SSH is required for pseudo-distributed Hadoop.

1 install JDK1.7
The JDK is provided in Linux. If the built-in version is not used, uninstall it.

(1) uninstall the built-in jdk version

View the built-in jdk

# Rpm-qa | grep gcj

The following information is displayed:

Libgcj-4.1.2-44.el5

Java-1.4.2-gcj-compat-1.4.2.0-40jpp.115

Run the rpm-e -- nodeps command to delete the content found above:

# Rpm-e -- nodeps java-1.4.2-gcj-compat-1.4.2.0-40jpp.115

(2) uninstall the jkd version installed by rpm

View the installed jdk:

# Rpm-qa | grep jdk

The following information is displayed:

Java-1.6.0-openjdk-1.6.0.0-1.7.b09.el5

Uninstall:

# Rpm-e -- nodeps java-1.6.0-openjdk-1.6.0.0-1.7.b09.el5

(3) install jdk

First download the installation package to the sun official website, the following is the latest installation package http://java.sun.com/javase/downloads/index.jsp

If you want to find a previous version, find the http://java.sun.com/products/archive/ at the address below

There are two versions of jdk-6u7-linux-i586-rpm.bin and jdk-6u7-linux-i586.bin. Bin is a binary package, while rpm is a RedHat package, which is the standard installation package of Red Hat. The difference is that the rpm is automatically configured during installation. Generally, lib is installed to/usr/lib, and bin is installed to/usr/bin. Even if it is not, you must also establish a soft connection under/usr/bin.

The following uses the latest version of jdk-7u3-linux-i586.rpm as an example to install:

Place the installation file in the/usr/java directory and modify the permission. The command is as follows (you must use the cd command to switch to the corresponding directory ):

# Chmod + x jdk-7u3-linux-i586.rpm

Install the execution file:

# Rpm-ivh jdk-7u3-linux-i586.rpm

(4) Configure Environment Variables

Modify the/etc/profile file and add

Export JAVA_HOME =/usr/java/jdk1.7.0 _ 03

Export PATH = $ PATH:/usr/java/jdk1.7.0 _ 03/bin

Save

(5) Execution

Cd/etc

Source profile

(6) Verify that JDK is successfully installed

[Root @ localhost ~] # Java-version

Java version "1.7.0"

Java (TM) SE Runtime Environment (build 1.7.0-b147)

Java HotSpot (TM) Server VM (build 21.0-b17, mixed mode)

2. Configure SSH password-free Login
First, make sure that you can connect to the Internet. Modify the/etc/ssh/sshd_config file under root (the client and server must be changed ),

# AuthorizedKeysFile. ssh/authorized_keys

Remove the # sign to enable it.

AuthorizedKeysFile. ssh/authorized_keys

If you need to log on to the root user through ssh, remove the # sign before "# PermitRootLogin yes.

In the same root account, restart the sshd service to make it take effect.

/Etc/rc. d/init. d/sshd restart

In the client, switch to the account that requires ssh login:

Ssh-keygen-t dsa

Generate a public/private key pair. Press enter and use the default file to save the key.

Xxxxx press enter to enter the password phrase, and press enter directly to do not create a password phrase. The password phrase must contain at least five characters.

Xxxxx press enter to repeat the password phrase, and press enter directly to do not create a password phrase

You can also:

Ssh-keygen-t dsa-p'-f/home/account name/. ssh/id_dsa

Ssh-keygen indicates the generated key;-t (case sensitive) indicates the type of the generated key; dsa indicates the dsa key authentication, that is, the key type;-P is used to provide the secret; -f indicates the generated key file. This

The public and private keys are generated in the. ssh directory of this account. id_dsa is the private key and id_dsa.pub is the public key.

Server, switch to the account that requires ssh login:

If no public/private key pair has been set up on the server, first create a public/private key pair using the client method.

Scp account name @ client host name or IP Address:/home/account name/. ssh/id_dsa.pub/home/account name/

Copy the public key of the client account to the server. The client can use either the host name or IP address, but it must be consistent with the ssh logon command. We recommend that you write the Host Name and IP address to/etc/hosts. Use the host name here (the same below ).

Yes writes the client to known_hosts

Enter the account and password in the client to complete the copy.

Add the content in id_rsa.pub to the authorized_keys file. The authorized_keys file name must be consistent with the settings in sshd_config.

Cat/home/account name/id_rsa.pub>/home/account name/. ssh/authorized_keys

Modify the permissions of the authorized_keys file, which must be at least 644 or more strict (600). ssh password-less logon does not work if no change is made:

Chmod 644/home/account name/. ssh/authorized_keys

In the client, switch to the account that requires ssh login:

Ssh server host name-log on to the server. If the account name on the server is different from that on the client

Ssh account name @ server host name

Yes -- write the server to known_hosts

If a password phrase is set when a public/private key is generated, you also need to enter the password phrase. For convenience, you can enter ssh-add in the client, then enter the password phrase, and then log on to the client over ssh again, no Password phrase is required.

Verification:

[Test @ localhost ~] $ Ssh-version

OpenSSH_4.3p2, OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008

Bad escape character 'rsion '.

It indicates that SSH has been installed successfully. Enter the following command:

Ssh localhost

The following information is displayed:

[Test @ localhost. ssh] $ ssh localhost

The authenticity of host 'localhost (127.0.0.1) 'can't be established.

RSA key fingerprint is 2b: 36: 6b: 23: 4d: e4: 71: 2b: b0: 79: 69: 36: 0e: 65: 3b: 0f.

Are you sure you want to continue connecting (yes/no )? Yes

Warning: Permanently added 'localhost' (RSA) to the list of known hosts.

Last login: Tue Oct 30 13:08:33 2012 from localhost. localdomain

This indicates that the installation is successful. When you log on for the first time, you will be asked if you want to continue the link. Enter yes to enter.

In fact, it does not matter whether to log on without a password during Hadoop installation. However, if you do not configure a password-free login, start Hadoop every time, you need to enter the password to log on to the DataNode of each machine. Considering that Hadoop clusters usually have hundreds or thousands of machines, SSH password-less login is usually configured.

An understanding of public/private key pairs: public/private key pairs are like a set of keys and locks, public keys are locks, and private keys are keys. The private key stays in the client and the Public Key is sent to the server. The server can have many locks, and the client only has one key. When the client ssh logs on to the server, the server will find the lock issued by the client, and then ask the client for the key. If the key is matched, the login is successful, and the login fails. Of course, the above is for the same user. Different users have different public/private key pairs, and each ssh-keygen generates different public/private key pairs.

Note:

L enter the superuser mode. That is, enter "su-". The system will ask you to enter the superuser password and enter the password to enter the superuser mode.

L add the write permission for the file. That is, enter the command "chmod u + w/etc/sudoers ".

L edit the/etc/sudoers file. That is, enter the command "vim/etc/sudoers", enter "I" to enter the editing mode, and find this line: "root ALL = (ALL) ALL "add" xxx ALL = (ALL) ALL "(here xxx is your user name), and save it (that is, Press Esc and then enter ": wq ") quit.

L revoke the write permission of a file. That is, enter the command "chmod u-w/etc/sudoers ".

3. Install and run Hadoop
The role of Hadoop on each node is described as follows:

Hadoop divides hosts into two roles from three perspectives. First, it is divided into master and salve, that is, master and slave. Second, from the HDFS perspective, it is divided into NameNode and DataNode (in Distributed File Systems, target management is very important, directory management is equivalent to the master, while NameNode is the Directory Manager). Third, from the MapReduce perspective, the host is divided into JobTracker and TaskTracker (a job is often divided into multiple tasks, from this perspective, it is not difficult to understand the relationship between them ).

Hadoop has an official release and cloudera, among which cloudera is a commercial version of Hadoop. The following describes how to install the official release of Hadoop.

Hadoop has three running modes: Single Node mode, click pseudo distribution mode, and cluster mode. At first glance, the first two methods do not reflect the advantages of cloud computing. They have no significance in practical applications, but they still make sense in the testing and debugging of programs.

You can download the official release of Hadoop from the address below:

Http://www.apache.org/dist/hadoop/core/

Download hadoop-1.0.4.tar.gz and decompress it to the user directory:/home/[user]/

Tar-xzvf hadoop-1.0.4.tar.gz

L single node configuration

You do not need to configure a single-node Hadoop installation. In this way, Hadoop is considered a separate java Process, which is often used for testing.

L pseudo distributed configuration

We can regard pseudo-distributed Hadoop as a node cluster. in this cluster, this node is both master and salve, NameNode and DataNode, and JobTracker and TaskTracker.

The pseudo-distributed configuration process is also very simple. You only need to modify several files, as shown below.

Go to the conf folder (under the decompressed directory) and modify the configuration file.

[Test @ localhost conf] $ pwd

/Home/test/hadoop-1.0.4/conf

[Test @ localhost conf] $ ls hadoop-env.sh

Hadoop-env.sh

[Test @ localhost conf] $ vim hadoop-env.sh

Add content:

Export JAVA_HOME =/usr/java/jdk1.7.0

Specifies the JDK installation location

[Test @ localhost conf] $ pwd

/Home/test/hadoop-1.0.4/conf

[Test @ localhost conf] $ ls core-site.xml

Core-site.xml

Modify file

[Test @ localhost conf] $ vim core-site.xml

Add content:

<Name> fs. default. name </name>

<Value> hdfs :/// localhost: 9000 </value>

</Property>

</Configuration>

This is the core configuration file of hadoop. The address and port number of HDFS are configured here.

[Test @ localhost conf] $ ls hdfs-site.xml

Hdfs-site.xml

Modify file:

<Name> dfs. replication </name>

</Property>

</Configuration>

This is the HDFS configuration in Hadoop. The default backup mode is 3. In the standalone version of Hadoop, you need to change it to 1.

[Test @ localhost conf] $ ls mapred-site.xml

Mapred-site.xml

[Test @ localhost conf] $ vim mapred-site.xml

Modify file:

<Name> mapred. job. tracker </name>

<Value> localhost: 9001 </value>

</Property>

</Configuration>

This is the configuration file of MapReduce in Hadoop, Which is configured with the address and port of JobTracker.

Note that if the version is installed earlier than version 0.20, there is only one configuration file, that is, the Hadoop-site.xml.

Next, you need to format the HDFS File System of Hadoop before starting Hadoop (this is the same as that of Windows, and the volume after partitioning always needs to be formatted ). Enter the Hadoop folder and enter the following command:

[Test @ localhost hadoop-1.0.4] $ bin/hadoop namenode-format

12/11/01 00:20:50 INFO namenode. NameNode: STARTUP_MSG:

Re-format filesystem in/tmp/hadoop-test/dfs/name? (Y or N) Y

12/11/01 00:20:55 INFO util. GSet: VM type = 32-bit

12/11/01 00:20:55 INFO util. GSet: 2% max memory = 17.77875 MB

12/11/01 00:20:55 INFO util. GSet: capacity = 2 ^ 22 = 4194304 entries

12/11/01 00:20:55 INFO util. GSet: recommended = 4194304, actual = 4194304

12/11/01 00:20:55 INFO namenode. FSNamesystem: fsOwner = test

12/11/01 00:20:55 INFO namenode. FSNamesystem: supergroup = supergroup

12/11/01 00:20:55 INFO namenode. FSNamesystem: isPermissionEnabled = true

12/11/01 00:20:55 INFO namenode. FSNamesystem: dfs. block. invalidate. limit = 100

12/11/01 00:20:55 INFO namenode. FSNamesystem: isAccessTokenEnabled = false accessKeyUpdateInterval = 0 min (s), accessTokenLifetime = 0 min (s)

12/11/01 00:20:55 INFO namenode. NameNode: Caching file names occuring more than 10 times

12/11/01 00:20:56 INFO common. Storage: Image file of size 110 saved in 0 seconds.

12/11/01 00:20:56 INFO common. Storage: Storage directory/tmp/hadoop-test/dfs/name has been successfully formatted.

12/11/01 00:20:56 INFO namenode. NameNode: SHUTDOWN_MSG:

Format the file system and start Hadoop.

First, grant the user test the permission to use the hadoop Folder:

[Test @ localhost ~] $ Chown-hR test/home/test/hadoop-1.0.4

Enter the following command:

[Test @ localhost hadoop-1.0.4] $ bin/start-all.sh

Starting namenode, logging to/home/test/hadoop-1.0.4/libexec/../logs/hadoop-test-namenode-localhost.localdomain.out

Localhost: starting datanode, logging to/home/test/hadoop-1.0.4/libexec/../logs/hadoop-test-datanode-localhost.localdomain.out

Localhost: starting secondarynamenode, logging to/home/test/hadoop-1.0.4/libexec/../logs/hadoop-test-secondarynamenode-localhost.localdomain.out

Starting jobtracker, logging to/home/test/hadoop-1.0.4/libexec/../logs/hadoop-test-jobtracker-localhost.localdomain.out

Localhost: starting tasktracker, logging to/home/test/hadoop-1.0.4/libexec/../logs/hadoop-test-tasktracker-localhost.localdomain.out

Use jps to view started services:

[Test @ localhost ~] $ Cd/home/test/hadoop-1.0.4

[Test @ localhost hadoop-1.0.4] $ jps

12657 SecondaryNameNode

12366 NameNode

Jps 12995

12877 TaskTracker

12739 JobTracker

12496 DataNode

Finally, verify that Hadoop is successfully installed. Open your browser and enter the URL:

Http: // localhost: 50070/(HDFS Web page)

Http: // localhost: 50030/(MapReduce Web page)

If you can see it, it indicates that Hadoop has been installed successfully. For Hadoop, the installation of MapReduce and HDFS is required. However, if necessary, you can only start HDFS or MapReduce:

[Test @ localhost hadoop-1.0.4] $ bin/start-dfs.sh

[Test @ localhost hadoop-1.0.4] $ bin/start-mapred.sh

Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

For more details, please continue to read the highlights on the next page:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More