Hadoop pseudo-distributed mode configuration and installation
Hadoop pseudo-distributed mode configuration and installation
The basic installation of hadoop has been introduced in the previous hadoop standalone mode. This section describes the basic simulation and deployment of hadoop in the pseudo-distributed mode of hadoop.
Install software:
System: Linux 2.6.32-358. el6.x86 _ 64
JDK: jdk-7u7-linux-i586.tar.gz
Hadoop version: hadoop-0.20.2-cdh3u4.tar.gz
Hardware environment:
Three hosts:
Gdy192 192.168.61.192
Gdy194 192.168.61.194
Gdy195 192.168.61.195
The deployment model is as follows:
Deploy on gdy192: NameNode and JobTracker
Deploy on gdy194: SecondaryNameNode
Deploy on gdy195: DateNode TaskTracker
First, configure the hosts files of the three hosts so that they can access each other without using ip addresses.
First, configure a copy of information on gdy192.
[Root @ gdy192/] # vim/etc/hosts
Wq save and exit
Copy the configured files to the other two hosts.
Copy the file to gdy194.
[Root @ gdy192 ~] # Scp/etc/hosts root @ gdy194:/etc/
Enter the root password of gdy194
Copied successfully.
Go to gdy194 and check/etc/hosts to verify whether it is the file we just modified.
[Root @ gdy194/] # cat/etc/hosts
You can see that the copy is successful.
Copy again to gdy195
On gdy192, enter:
[Root @ gdy192 ~] # Scp/etc/hosts root @ gdy195:/etc/
It will not be verified here.
Create the jDK and Hadoop installation directory gd on gdy192
[Root @ gdy192/] # mkdir/usr/gd/-pv
Create the JDK and Hadoop installation directory gd on gdy194
Create the JDK and Hadoop installation directory gd on gdy195
Create hduser users and Set passwords on gdy192, gdy194, and gdy195 respectively.
On gdy192
[Root @ gdy192/] # useradd hduser
[Root @ gdy192/] # passwd hduser
On gdy194
[Root @ gdy194/] # useradd hduser
[Root @ gdy194/] # passwd hduser
On gdy195
[Root @ gdy195/] # useradd hduser
[Root @ gdy195/] # passwd hduser
Copy the prepared software package to gdy192,
If the file has been copied
Decompress these two files to the Created directory/usr/gd /.
[Root @ gdy192ftpftp] # tar-xf jdk-7u7-linux-i586.tar.gz-C/usr/gd/
[Root @ gdy192ftpftp] # tar-xf hadoop-0.20.2-cdh3u4.tar.gz-C/usr/gd/
Use ls/usr/gd/to view the extracted files.
Create a soft link for jdk and hadoop in the/usr/gd directory.
[Root @ gdy192ftpftp] # ln-s/usr/gd/jdk1.7.0 _ 07 // usr/gd/java
[Root @ gdy192ftpftp] # ln-s/usr/gd/hadoop-0.20.2-cdh3u4 // usr/gd/hadoop
[Root @ gdy192ftpftp] # ll/usr/gd/
Configure java and hadoop Environment Variables
Configure java environment variables
[Root @ gdy192/] # vim/etc/profile. d/java. sh
Add the following information:
JAVA_HOME =/usr/gd/java
PATH = $ JAVA_HOME/bin: $ PATH
Export JAVA_HOMEPATH
Wq save and exit
Configure hadoop Environment Variables
[Root @ gdy192/] # vim/etc/profile. d/hadoop. sh
Add the following information:
HADOOP_HOME =/usr/gd/hadoop
PATH = $ HADOOP_HOME/bin: $ PATH
Export HADOOP_HOMEPATH
Wq save and exit
Use scp to copy the two files to the/etc/profile. d/directory on gdy194 and gdy195 respectively.
Copy to gdy194
[Root @ gdy192/] # scp/etc/profile. d/java. sh root @ gdy194:/etc/profile. d/
[Root @ gdy192/] # scp/etc/profile. d/hadoop. sh root @ gdy194:/etc/profile. d/
Copy to gdy195
[Root @ gdy192/] # scp/etc/profile. d/java. sh root @ gdy195:/etc/profile. d/
[Root @ gdy192/] # scp/etc/profile. d/hadoop. sh root @ gdy195:/etc/profile. d/
Modify the owner and group of all files in the/usr/gd/directory as hduser
[Root @ gdy192/] # chown-R hduser. hduser/usr/gd
Switch to the hduser on gdy192
[Root @ gdy192/] # su-hduser
Use ssh-keygen and ssh-copy-id as gdy192 to directly access hduser users under gdy194 and gdy195 without a password
Command:
Create a key file first
[Hduser @ gdy192 ~] $ Ssh-keygen-t rsa-p''
Enter
Use ssh-copy-id to copy the generated secret to the hduser on the gdy194 machine so that gdy192 can access gdy194 without a password.
[Hduser @ gdy192 ~] $ Ssh-copy-id-I. ssh/id_rsa.pub hduser @ gdy194
Enter yes
Enter the hduser password on gdy194
Use ssh-copy-id to copy the generated secret to the hduser on the gdy195 machine so that gdy192 can access gdy195 without a password.
[Hduser @ gdy192 ~] $ Ssh-copy-id-I. ssh/id_rsa.pub hduser @ gdy195
Use ssh-copy-id to copy the generated secret to the hduser on gdy192 so that gdy192 can access gdy192 without a password.
Note: Even if hadoop uses an ip address for scheduled access, even if it accesses its own machine, if no password is configured for access, you need to enter a password for access. This is the same as when configuring the hadoop standalone mode. You need to configure password-free access.
[Hduser @ gdy192 ~] $ Ssh-copy-id-I. ssh/id_rsa.pub hduser @ gdy192
Verify that gdy192 has accessed gdy194 without a password.
[Hduser @ gdy192 ~] $ Ssh gdy194 'date'
If the date on gdy194 is displayed without a password, the configuration is successful.
Verify that gdy192 has accessed gdy195 without a password.
[Hduser @ gdy192 ~] $ Ssh gdy195 'date'
Verify that gdy192 does not need to be input to access your gdy192
[Hduser @ gdy192 ~] $ Ssh gdy192 'date'
Check whether the system world on the three machines is the same
[Hduser @ gdy192 ~] $ Ssh gdy194 'date'; ssh gdy195 'date'; ssh gdy192 'date'
Synchronize time on three nodes:
Note: Because hadoop does not have the permission to modify the time, You need to configure root to access gdy194, gdy195, gdy192 without a password. Then set the time in a unified manner. Or you can design other methods to ensure time synchronization. However, time synchronization is required in a deployed environment. If this step is not configured, it will not have much impact on the hadoop pseudo-distributed mode. However, we recommend that you configure it.
The configuration code is as follows.
[Hduser @ gdy192 ~] $ Exit
Exit hduser first
[Root @ gdy192/] # cd ~
Enter the root user's home directory
Create a key file
[Root @ gdy192 ~] # Ssh-keygen-t rsa-p''
Copy the key file to gdy194, gdy195
[Root @ gdy192 ~] # Ssh-copy-id-I. ssh/id_rsa.pub root @ gdy194
[Root @ gdy192 ~] # Ssh-copy-id-I. ssh/id_rsa.pub root @ gdy195
Because the root user needs to confirm yes for each access, it is useless to use the root user to configure password-less access. The configuration here is only to synchronize the time of the three computers.
First check the time of the three computers:
[Root @ gdy192 ~] # Ssh gdy194 'date'; ssh gdy195 'date'; date
Set the time of the three computers to the same time:
[Root @ gdy192 ~] # Ssh gdy194 'date 0929235415 '; ssh gdy195 'date 0929235415'; date 0929235415
View time again
[Root @ gdy192 ~] # Ssh gdy194 'date'; ssh gdy195 'date'; date
We can see that the time here has been synchronized.
Use gdy192 to switch to the hduser
[Root @ gdy192 ~] # Su-hduser
Check the time of the three computers:
[Hduser @ gdy192 ~] $ Ssh gdy194 'date'; ssh gdy195 'date'; ssh gdy192 date
Next we will start to configure the hadoop configuration file.
In the file directory of the hadoop Process
[Hduser @ gdy192hadoop] $ cd/usr/gd/hadoop/conf/
Important files have been explained in the hadoop standalone mode configuration. I will not repeat the description here. For details, see "hadoop standalone mode configuration and installation".
Edit the masters File
[Hduser @ gdy192conf] $ vim masters
Change the original localhost to gdy194
Wq save and exit
Note: As mentioned above, gdy194 is used as the Name node of SecondaryNameNode.
And masters is used to configure the second name node.
Edit slaves files
[Hduser @ gdy192conf] $ vim slaves
Change the original localhost to gdy195
Wq save and exit
Data nodes are also defined here.
Edit file core-site.xml
[Hduser @ gdy192conf] $ vim core-site.xml
In Directly Add the following information:
Hadoop. tmp. dir
/Hadoop/temp
Fs. default. name
Hdfs: // gdy192: 8020
Wq save and exit
Note: fs. default. name defines the master node. Because the configuration files on each node are the same, ip addresses or aliases must be used to define the location of the master node.
As a hadoop cache file directory is defined here, we need to create this cache file directory on three computers.
Switch to the root user.
[Hduser @ gdy192conf] $ su-root
Create/hadoop directory
[Root @ gdy192 ~] # Mkdir/hadoop/
Modify the owner and owner of the hadoop directory to hduser, so that hduser can have the write permission under this directory.
[Root @ gdy192 ~] # Chown-R hduser. hduser/hadoop
Create such a directory on gdy194 and gdy195 and grant hadoop permissions.
On gdy194
[Root @ gdy194/] # mkdir/hadoop
[Root @ gdy194/] # chown-R hduser. hduser/hadoop
On gdy195
[Root @ gdy195/] # mkdir hadoop
[Root @ gdy195/] # chown-R hduser. hduser/hadoop
Use gdy192
Exit the current user and return the previous hduser.
[Root @ gdy192 ~] # Exit
Note: As you have just logged on directly, you can exit and return to the directory of the previous hduser and hduser operations.
Edit file mapred-site.xml
[Root @ gdy192conf] # vim mapred-site.xml.
In And Add the following information.
Mapred. job. tracker
Gdy192: 8021
Wq save and exit
Similarly, because JobTracker is defined here, and we have deployed it to indicate that jobTracker is stored on gdy192.
Therefore, in standalone mode, the localhost must be changed to an ip address or an ip alias.
Edit file: hdfs-site.xml
[Root @ gdy192conf] # vim hdfs-site.xml.
In And Add the following information.
Dfs. replication
1
The actualnumber of replications can be specified when the file iscreated.
Dfs. data. dir
/Hadoop/data
Ture
Thedirectories where the datanode stores blocks.
Dfs. name. dir
/Hadoop/name
Ture
Thedirectories where the namenode stores its persistentmatadata.
Fs. checkpoint. dir
/Hadoop/namesecondary
Ture
Thedirectories where the secondarynamenode stores checkpoints.
Wq save and exit
Note: Here is the definition of the location of other directories in hadoop, if not defined here, will default use the default cache directory defined in the core-site.xml file.
Now the hadoop configuration file has been configured.
To the/etc/gd/folder. Create the Connection Files respectively. (The operation is the same before going up)
This is due to repeated operations. Do not explain.
Next, copy the file that has been configured on gdy192 to the corresponding location of gdy194 and gdy195.
The method is as follows:
Use root user on gdy192
Copy the file to gdy194 and gdy195.
[Hduser @ gdy192hadoop] $ scp/usr/gd/hadoop/conf/* gdy194:/usr/gd/hadoop/conf/
[Hduser @ gdy192 hadoop] $ scp/usr/gd/hadoop/conf/* gdy195:/usr/gd/hadoop/conf/
Use the root user on gdy194 and gdy195 to grant hduser permissions to the/usr/gd/hadoop folder respectively.
On gdy194
[Root @ gdy194/] # chown hduser. hduser/usr/gd/-R
[Root @ gdy194/] # ll/usr/gd/
On gdy195
[Root @ gdy195/] # chown hduser. hduser/usr/gd/-R
[Root @ gdy195/] # ll/usr/gd/
The pseudo-distributed mode of hadoop has been fully configured.
Start the hadoop pseudo-distributed mode
Use the gdy192 host. Log on to the root user again.
Switch to hduser
Format hadoop's file system HDFS
[Hduser @ gdy192 ~] $ Hadoop namenode-format
Start hadoop
[Hduser @ gdy192 ~] $ Start-all.sh
We can see that the NameNode and JobTracker nodes are successfully started on gdy192.
Check whether SecondaryNameNode is successfully started on gdy194.
[Hduser @ gdy192 ~] $ Ssh gdy194 'jps'
You can see that the instance has been started successfully.
Check whether DataNode and TaskTracker are successfully started on gdy195.
[Hduser @ gdy192 ~] $ Ssh gdy195 'jps'
Now we can see that all started successfully.
Use
[Hduser @ gdy192 ~] $ Netstat-nlpt
You can view the hadoop Port
Among them, port 50030 is the external web URL port of hadoop. You can view information about hadoop MapReduce jobs.
50070 indicates the Namenode node information of hadoop.
View hadoop MapReduce job information
Access: http: // 192.168.61.192: 50030/jobtracker. jsp in a browser
For example:
View hadoop NameNode node Information
You can access: http: // 192.168.61.192: 50070/dfshealth. jsp in a browser.
For example:
Because SecondaryNameNode is deployed on gdy194.
View the Hadoop process port information on gdy194.
[Hduser @ gdy192 ~] $ Ssh gdy194 'netstat-nlpt'
Port 50090 is the external web port of hadoop's SecondaryNameNode.
You can use: http: // 192.168.61.194: 50090/status. jsp
To access the external web port of SecondaryNameNode.
Similarly, because DataNode and TaskTracker are deployed on gdy195.
View the port information on gdy195 on gdy192
[Hduser @ gdy192 ~] $ Ssh gdy195 'netstat-nlpt'
50060 indicates the node information of TaskTracker of hadoop
50075 indicates the node information of hadoop DateNoe
The following addresses are available for access:
Http:/// 192.168.61.195: 50075/
Http:/// 192.168.61.195: 50060/tasktracker. jsp
Note: In actual deployment, the IP address before the above web access port is the IP address you actually deployed. Here I listed it according to the IP address I deployed.
Perform hadoop word statistics on hadoop
Use machine gdy192
Create a new text folder on the DNSF File System of hadoop
View created folders
[Hduser @ gdy192 ~] $ Hadoop fs-ls/
Upload a system file to the test folder.
[Hduser @ gdy192 ~] $ Hadoop fs-put/etc/hosts/test
View uploaded files
[Hduser @ gdy192 ~] $ Hadoop fs-ls/test
Make word statistics on all the test files in the hadoop directory, and output the statistical results to the word folder.
[Hduser @ gdy192 ~] $ Hadoop jar/usr/gd/hadoop/hadoop-examples-0.20.2-cdh3u4.jar wordcount/test/word
In this process, you can use
Http: // 192.168.61.192: 50030/jobtracker. jsp.
Is displayed after the job is executed.
View the output directory of word statistics
[Hduser @ gdy192 ~] $ Hadoop fs-ls/word
View the results output file part-r-00000 can see the test directory file to do the word statistics of the statistical results
[Hduser @ gdy192 ~] $ Hadoop fs-cat/word/part-r-00000
This is the statistical result of the previous statistics.
In this document, the installation and deployment of hadoop standalone mode and hadoop pseudo distribution mode are complete.
In fact, after hadoop is installed separately, hbase is usually installed again to work on hadoop. This facilitates data storage and management. How to deploy hbase in hadoop standalone mode and hbase in hadoop pseudo-distributed mode. It will be announced later.