Install and configure Hadoop in Linux
Before installing Hadoop on Linux, you need to install two programs:
- JDK 1.6 or later;
- We recommend that you install OpenSSH for SSH (Secure Shell protocol ).
The following describes the reasons for installing these two programs:
Hadoop is developed using Java. JDK is required for compiling Hadoop and running MapReduce.
Hadoop needs to use SSH to start the daemon processes of each host in the salve list. Therefore, SSH must be installed, even if the pseudo-distributed version is installed (because Hadoop does not distinguish cluster-based and pseudo-distributed ). For pseudo-distributed systems, Hadoop uses the same processing method as the cluster, that is, starting the processes on the host recorded in file conf/slaves in order, except that salves is localhost (itself) in pseudo-distributed systems ), therefore, SSH is required for pseudo-distributed Hadoop.
1 install JDK1.7
The JDK is provided in Linux. If the built-in version is not used, uninstall it.
(1) uninstall the built-in jdk version
View the built-in jdk
# Rpm-qa | grep gcj
The following information is displayed:
Libgcj-4.1.2-44.el5
Java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
Run the rpm-e -- nodeps command to delete the content found above:
# Rpm-e -- nodeps java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
(2) uninstall the jkd version installed by rpm
View the installed jdk:
# Rpm-qa | grep jdk
The following information is displayed:
Java-1.6.0-openjdk-1.6.0.0-1.7.b09.el5
Uninstall:
# Rpm-e -- nodeps java-1.6.0-openjdk-1.6.0.0-1.7.b09.el5
(3) install jdk
First download the installation package to the sun official website, the following is the latest installation package http://java.sun.com/javase/downloads/index.jsp
If you want to find a previous version, find the http://java.sun.com/products/archive/ at the address below
There are two versions of jdk-6u7-linux-i586-rpm.bin and jdk-6u7-linux-i586.bin. Bin is a binary package, while rpm is a RedHat package, which is the standard installation package of Red Hat. The difference is that the rpm is automatically configured during installation. Generally, lib is installed to/usr/lib, and bin is installed to/usr/bin. Even if it is not, you must also establish a soft connection under/usr/bin.
The following uses the latest version of jdk-7u3-linux-i586.rpm as an example to install:
Place the installation file in the/usr/java directory and modify the permission. The command is as follows (you must use the cd command to switch to the corresponding directory ):
# Chmod + x jdk-7u3-linux-i586.rpm
Install the execution file:
# Rpm-ivh jdk-7u3-linux-i586.rpm
(4) Configure Environment Variables
Modify the/etc/profile file and add
Export JAVA_HOME =/usr/java/jdk1.7.0 _ 03
Export PATH = $ PATH:/usr/java/jdk1.7.0 _ 03/bin
Save
(5) Execution
Cd/etc
Source profile
(6) Verify that JDK is successfully installed
[Root @ localhost ~] # Java-version
Java version "1.7.0"
Java (TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot (TM) Server VM (build 21.0-b17, mixed mode)
2. Configure SSH password-free Login
First, make sure that you can connect to the Internet. Modify the/etc/ssh/sshd_config file under root (the client and server must be changed ),
# AuthorizedKeysFile. ssh/authorized_keys
Remove the # sign to enable it.
AuthorizedKeysFile. ssh/authorized_keys
If you need to log on to the root user through ssh, remove the # sign before "# PermitRootLogin yes.
In the same root account, restart the sshd service to make it take effect.
/Etc/rc. d/init. d/sshd restart
In the client, switch to the account that requires ssh login:
Ssh-keygen-t dsa
Generate a public/private key pair. Press enter and use the default file to save the key.
Xxxxx press enter to enter the password phrase, and press enter directly to do not create a password phrase. The password phrase must contain at least five characters.
Xxxxx press enter to repeat the password phrase, and press enter directly to do not create a password phrase
You can also:
Ssh-keygen-t dsa-p'-f/home/account name/. ssh/id_dsa
Ssh-keygen indicates the generated key;-t (case sensitive) indicates the type of the generated key; dsa indicates the dsa key authentication, that is, the key type;-P is used to provide the secret; -f indicates the generated key file. This
The public and private keys are generated in the. ssh directory of this account. id_dsa is the private key and id_dsa.pub is the public key.
Server, switch to the account that requires ssh login:
If no public/private key pair has been set up on the server, first create a public/private key pair using the client method.
Scp account name @ client host name or IP Address:/home/account name/. ssh/id_dsa.pub/home/account name/
Copy the public key of the client account to the server. The client can use either the host name or IP address, but it must be consistent with the ssh logon command. We recommend that you write the Host Name and IP address to/etc/hosts. Use the host name here (the same below ).
Yes writes the client to known_hosts
Enter the account and password in the client to complete the copy.
Add the content in id_rsa.pub to the authorized_keys file. The authorized_keys file name must be consistent with the settings in sshd_config.
Cat/home/account name/id_rsa.pub>/home/account name/. ssh/authorized_keys
Modify the permissions of the authorized_keys file, which must be at least 644 or more strict (600). ssh password-less logon does not work if no change is made:
Chmod 644/home/account name/. ssh/authorized_keys
In the client, switch to the account that requires ssh login:
Ssh server host name-log on to the server. If the account name on the server is different from that on the client
Ssh account name @ server host name
Yes -- write the server to known_hosts
If a password phrase is set when a public/private key is generated, you also need to enter the password phrase. For convenience, you can enter ssh-add in the client, then enter the password phrase, and then log on to the client over ssh again, no Password phrase is required.
Verification:
[Test @ localhost ~] $ Ssh-version
OpenSSH_4.3p2, OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008
Bad escape character 'rsion '.
It indicates that SSH has been installed successfully. Enter the following command:
Ssh localhost
The following information is displayed:
[Test @ localhost. ssh] $ ssh localhost
The authenticity of host 'localhost (127.0.0.1) 'can't be established.
RSA key fingerprint is 2b: 36: 6b: 23: 4d: e4: 71: 2b: b0: 79: 69: 36: 0e: 65: 3b: 0f.
Are you sure you want to continue connecting (yes/no )? Yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Last login: Tue Oct 30 13:08:33 2012 from localhost. localdomain
This indicates that the installation is successful. When you log on for the first time, you will be asked if you want to continue the link. Enter yes to enter.
In fact, it does not matter whether to log on without a password during Hadoop installation. However, if you do not configure a password-free login, start Hadoop every time, you need to enter the password to log on to the DataNode of each machine. Considering that Hadoop clusters usually have hundreds or thousands of machines, SSH password-less login is usually configured.
An understanding of public/private key pairs: public/private key pairs are like a set of keys and locks, public keys are locks, and private keys are keys. The private key stays in the client and the Public Key is sent to the server. The server can have many locks, and the client only has one key. When the client ssh logs on to the server, the server will find the lock issued by the client, and then ask the client for the key. If the key is matched, the login is successful, and the login fails. Of course, the above is for the same user. Different users have different public/private key pairs, and each ssh-keygen generates different public/private key pairs.
Note:
L enter the superuser mode. That is, enter "su-". The system will ask you to enter the superuser password and enter the password to enter the superuser mode.
L add the write permission for the file. That is, enter the command "chmod u + w/etc/sudoers ".
L edit the/etc/sudoers file. That is, enter the command "vim/etc/sudoers", enter "I" to enter the editing mode, and find this line: "root ALL = (ALL) ALL "add" xxx ALL = (ALL) ALL "(here xxx is your user name), and save it (that is, Press Esc and then enter ": wq ") quit.
L revoke the write permission of a file. That is, enter the command "chmod u-w/etc/sudoers ".
3. Install and run Hadoop
The role of Hadoop on each node is described as follows:
Hadoop divides hosts into two roles from three perspectives. First, it is divided into master and salve, that is, master and slave. Second, from the HDFS perspective, it is divided into NameNode and DataNode (in Distributed File Systems, target management is very important, directory management is equivalent to the master, while NameNode is the Directory Manager). Third, from the MapReduce perspective, the host is divided into JobTracker and TaskTracker (a job is often divided into multiple tasks, from this perspective, it is not difficult to understand the relationship between them ).
Hadoop has an official release and cloudera, among which cloudera is a commercial version of Hadoop. The following describes how to install the official release of Hadoop.
Hadoop has three running modes: Single Node mode, click pseudo distribution mode, and cluster mode. At first glance, the first two methods do not reflect the advantages of cloud computing. They have no significance in practical applications, but they still make sense in the testing and debugging of programs.
You can download the official release of Hadoop from the address below:
Http://www.apache.org/dist/hadoop/core/
Download hadoop-1.0.4.tar.gz and decompress it to the user directory:/home/[user]/
Tar-xzvf hadoop-1.0.4.tar.gz
L single node configuration
You do not need to configure a single-node Hadoop installation. In this way, Hadoop is considered a separate java Process, which is often used for testing.
L pseudo distributed configuration
We can regard pseudo-distributed Hadoop as a node cluster. in this cluster, this node is both master and salve, NameNode and DataNode, and JobTracker and TaskTracker.
The pseudo-distributed configuration process is also very simple. You only need to modify several files, as shown below.
Go to the conf folder (under the decompressed directory) and modify the configuration file.
[Test @ localhost conf] $ pwd
/Home/test/hadoop-1.0.4/conf
[Test @ localhost conf] $ ls hadoop-env.sh
Hadoop-env.sh
[Test @ localhost conf] $ vim hadoop-env.sh
Add content:
Export JAVA_HOME =/usr/java/jdk1.7.0
Specifies the JDK installation location
[Test @ localhost conf] $ pwd
/Home/test/hadoop-1.0.4/conf
[Test @ localhost conf] $ ls core-site.xml
Core-site.xml
Modify file
[Test @ localhost conf] $ vim core-site.xml
Add content:
<Configuration>
<Property>
<Name> fs. default. name </name>
<Value> hdfs :/// localhost: 9000 </value>
</Property>
</Configuration>
This is the core configuration file of hadoop. The address and port number of HDFS are configured here.
[Test @ localhost conf] $ ls hdfs-site.xml
Hdfs-site.xml
Modify file:
<Configuration>
<Property>
<Name> dfs. replication </name>
<Value> 1 </value>
</Property>
</Configuration>
This is the HDFS configuration in Hadoop. The default backup mode is 3. In the standalone version of Hadoop, you need to change it to 1.
[Test @ localhost conf] $ ls mapred-site.xml
Mapred-site.xml
[Test @ localhost conf] $ vim mapred-site.xml
Modify file:
<Configuration>
<Property>
<Name> mapred. job. tracker </name>
<Value> localhost: 9001 </value>
</Property>
</Configuration>
This is the configuration file of MapReduce in Hadoop, Which is configured with the address and port of JobTracker.
Note that if the version is installed earlier than version 0.20, there is only one configuration file, that is, the Hadoop-site.xml.
Next, you need to format the HDFS File System of Hadoop before starting Hadoop (this is the same as that of Windows, and the volume after partitioning always needs to be formatted ). Enter the Hadoop folder and enter the following command:
[Test @ localhost hadoop-1.0.4] $ bin/hadoop namenode-format
12/11/01 00:20:50 INFO namenode. NameNode: STARTUP_MSG:
Re-format filesystem in/tmp/hadoop-test/dfs/name? (Y or N) Y
12/11/01 00:20:55 INFO util. GSet: VM type = 32-bit
12/11/01 00:20:55 INFO util. GSet: 2% max memory = 17.77875 MB
12/11/01 00:20:55 INFO util. GSet: capacity = 2 ^ 22 = 4194304 entries
12/11/01 00:20:55 INFO util. GSet: recommended = 4194304, actual = 4194304
12/11/01 00:20:55 INFO namenode. FSNamesystem: fsOwner = test
12/11/01 00:20:55 INFO namenode. FSNamesystem: supergroup = supergroup
12/11/01 00:20:55 INFO namenode. FSNamesystem: isPermissionEnabled = true
12/11/01 00:20:55 INFO namenode. FSNamesystem: dfs. block. invalidate. limit = 100
12/11/01 00:20:55 INFO namenode. FSNamesystem: isAccessTokenEnabled = false accessKeyUpdateInterval = 0 min (s), accessTokenLifetime = 0 min (s)
12/11/01 00:20:55 INFO namenode. NameNode: Caching file names occuring more than 10 times
12/11/01 00:20:56 INFO common. Storage: Image file of size 110 saved in 0 seconds.
12/11/01 00:20:56 INFO common. Storage: Storage directory/tmp/hadoop-test/dfs/name has been successfully formatted.
12/11/01 00:20:56 INFO namenode. NameNode: SHUTDOWN_MSG:
Format the file system and start Hadoop.
First, grant the user test the permission to use the hadoop Folder:
[Test @ localhost ~] $ Chown-hR test/home/test/hadoop-1.0.4
Enter the following command:
[Test @ localhost hadoop-1.0.4] $ bin/start-all.sh
Starting namenode, logging to/home/test/hadoop-1.0.4/libexec/../logs/hadoop-test-namenode-localhost.localdomain.out
Localhost: starting datanode, logging to/home/test/hadoop-1.0.4/libexec/../logs/hadoop-test-datanode-localhost.localdomain.out
Localhost: starting secondarynamenode, logging to/home/test/hadoop-1.0.4/libexec/../logs/hadoop-test-secondarynamenode-localhost.localdomain.out
Starting jobtracker, logging to/home/test/hadoop-1.0.4/libexec/../logs/hadoop-test-jobtracker-localhost.localdomain.out
Localhost: starting tasktracker, logging to/home/test/hadoop-1.0.4/libexec/../logs/hadoop-test-tasktracker-localhost.localdomain.out
Use jps to view started services:
[Test @ localhost ~] $ Cd/home/test/hadoop-1.0.4
[Test @ localhost hadoop-1.0.4] $ jps
[Test @ localhost hadoop-1.0.4] $ jps
12657 SecondaryNameNode
12366 NameNode
Jps 12995
12877 TaskTracker
12739 JobTracker
12496 DataNode
Finally, verify that Hadoop is successfully installed. Open your browser and enter the URL:
Http: // localhost: 50070/(HDFS Web page)
Http: // localhost: 50030/(MapReduce Web page)
If you can see it, it indicates that Hadoop has been installed successfully. For Hadoop, the installation of MapReduce and HDFS is required. However, if necessary, you can only start HDFS or MapReduce:
[Test @ localhost hadoop-1.0.4] $ bin/start-dfs.sh
[Test @ localhost hadoop-1.0.4] $ bin/start-mapred.sh
Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04
Install and configure Hadoop2.2.0 on CentOS
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition
For more details, please continue to read the highlights on the next page: