1 Hadoop configuration
caveats: Turn off all firewalls
server |
ip |
system |
master |
< P align= "left" >10.0.0.9 |
centos 6.0 X64 |
slave1 |
10.0.0.11 |
Centos 6.0 X64 |
slave2 |
10.0.0.12 |
centos 6.0 X64 |
Hadoop version: hadoop-0.20.2.tar.gz1.1 in master: (Operations on SLAVE1 and Slave2 are the same as below)
#vi/etc/hosts three machines in the same configuration
10.0.0.9 Master
10.0.0.11 slave1
10.0.0.12 Slave2
#vi/etc/sysconfig/network
Networking=yes
Hostname= Host Name
#reboot
1.2 Login with root, install SCP (on Master, Slave1 and Slave2)
#yum Install Openssh-clients
1.3 Login with root, build Hadoop user (on Master, Slave1 and Slave2)
#useradd Hadoop
#passwd Hadoop input 111111 as a password
1.4 Configuring the SHH authentication 1.4.1 on Master
#su-hadoop into the Hadoop user directory
$SSH-keygen-t RSA to set up SSH directory, hit enter the end
$CD/home/hadoop/.ssh
1.4.2 on the slave1.
#su-hadoop into the Hadoop user directory
$SSH-keygen-t RSA to set up SSH directory, hit enter the end
1.4.3 on the slave2.
#su-hadoop into the Hadoop user directory
$SSH-keygen-t RSA to set up SSH directory, hit enter the end
1.4.4 on Master
$SCP-R id_rsa.pub [email protected]:/home/hadoop/.ssh/authorized_keys_m
Rename the key on Master to the slave1 Hadoop user and rename it to Authorized_keys_m
Password input: 111111 $SCP-R id_rsa.pub [email protected]:/home/hadoop/.ssh/authorized_keys_m
Rename the key on Master to the Slave2 Hadoop user and rename it to Authorized_keys_m
Password Input: 111111
1.4.5 on the slave1.
$ cd/home/hadoop/.ssh
$SCP-R id_rsa.pub [email protected]:/home/hadoop/.ssh/authorized_keys_s1
Upload the key on the slave1 to the Hadoop user of Master
$cat id_rsa.pub >> authorized_keys_m Add a local key Authorized_keys
$CP Authorized_keys_m Authorized_keys
$RM –RF Authorized_keys_m
1.4.6 on the slave2.
$ cd/home/hadoop/.ssh
$SCP-R id_rsa.pub [email protected]:/home/hadoop/.ssh/authorized_keys_s2
Upload the key on the slave2 to the Hadoop user of Master
$cat id_rsa.pub >> authorized_keys_m Add a local key Authorized_keys
$CP Authorized_keys_m Authorized_keys
$RM –RF Authorized_keys_m
1.4.7 on Master
$CD/home/hadoop/.ssh
$cat id_rsa.pub >> authorized_keys_s1 Add a local key authorized_keys_s1
$ authorized_keys_s2 >> authorized_keys_s1
$CP authorized_keys_s1 Authorized_keys
$RM –RF authorized_keys_s1 Authorized_keys_s2
1.5 Login with root, install JDK (on master, Slave1 and Slave2)
#yum Install Java-1.6.0-openjdk-devel
This installation, Java executable files are automatically added to the/usr/bin/directory.
Verify the shell command: java-version See if it matches your version number.
1.6 Login with root, install Hadoop (in master) myself in Slave1 and Slave2 also added, own Add. I don't know if I can
#mount –t auto/dev/sdb/mnt (USB stick mount is not required)
#cp Hadoop-0.2.0**.tar.z/home/hadoop
#cd/home/hadoop
#tar –vxzf Hadoop***.tar.gz/home/hadoop
Modify the/etc/profile on master to add the following:
Export hadoop_home=/home/hadoop/hadoop-0.20.2
Export path= $PATH: $HADOOP _home/bin
Perform
#source/etc/profile to bring it into effect
1.7 Login with root, install Hadoop (in master)
Configuration conf/hadoop-env.sh file
#添加 Export Java_home=
/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64 here to modify the installation location for your JDK.
Test Hadoop Installation: (with Hadoop users)
Hadoop jar Hadoop-0.20.2-examples.jar WordCount conf//tmp/out
1.8 Cluster configuration (all nodes are the same) or in master configuration, copy to other machine 1.8.1 profile: Conf/core-site.xml
<?xml version= "1.0"?>
<?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:49000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoop_home/var</value>
</property>
</configuration>
1) Fs.default.name is the URI of the Namenode. HDFS://Host Name: Port/
2) Hadoop.tmp.dir:Hadoop Default temporary path, this is the best configuration, if the new node or in other cases, the inexplicable datanode can not start, delete the TMP directory in this file. However, if you delete this directory for the Namenode machine, you will need to re-execute the namenode formatted command.
1.8.2 configuration file: Conf/mapred-site.xml
<?xml version= "1.0"?>
<?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:49001</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/hadoop/hadoop_home/var</value>
</property>
</configuration>
1) Mapred.job.tracker is the host (or IP) and port of the Jobtracker. Host: Port.
1.8.3 configuration file: Conf/hdfs-site.xml
<?xml version= "1.0"?>
<?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?>
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/name1 </value>
<description> </description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/data1 </value>
<description> </description>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
1) dfs.name.dir is a local file system path Namenode persistent storage namespace and transaction log. When this value is a comma-delimited list of directories, the NameTable data will be copied to all directories for redundant backups.
2) Dfs.data.dir is a comma-separated list of local file system paths that Datanode store block data. When this value is a comma-separated list of directories, the data will be stored in all directories, usually distributed on different devices.
3) Dfs.replication is the amount of data that needs to be backed up, which is 3 by default, and if this number is larger than the number of machines in the cluster.
Note: Here the name1 , name2 , data1 , data2 directories cannot be pre-created, Hadoop It is created automatically when you format it, but there is a problem if you create it beforehand.
1.8.4 Configuring Masters and Slaves Master-slave nodes
Configure Conf/masters and conf/slaves to set master-slave nodes, note that host names are best used, and that each host name can be accessed through the host name.
VI Masters:
Input:
Node1
VI Slaves:
Input:
Node2
Node3
Configure the end, copy the configured Hadoop folder to the other cluster machine , and ensure that the above configuration is correct for other machines, for example: If the Java installation path of other machines is different, modify conf/ hadoop-env.sh
$ scp-r/home/hadoop/hadoop-0.20.2 [email protected]:/home/hadoop
$ scp-r/home/hadoop/hadoop-0.20.2 [email protected]:/home/hadoop
1.8.5 Root Change directory permissions (master, slave1, Slave2) (this step is optional)
#cd/home/
#chown-R Hadoop:hadoop Hadoop
#chmod ugo+rwx Hadoop
1.9 Hadoop Boot
Precautions:
Format, start Hadoop, just start at master, and slave will start automatically with master. Manual operation is not required for slave.
1.9.1 (Hadoop user) formats a new Distributed file system (shuts down every firewall such as iptable)
First, format a new Distributed File system
$ bin/hadoop Namenode-format
System output in case of success:
12/02/06 00:46:50 INFO Namenode. Namenode:startup_msg:
/************************************************************
Startup_msg:starting NameNode
Startup_msg:host = ubuntu/127.0.1.1
Startup_msg:args = [-format]
Startup_msg:version = 0.20.203.0
Startup_msg:build =http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203-r 1099333; Compiled by ' Oom ' on Wed 4 07:57:50 PDT 2011
************************************************************/
12/02/0600:46:50 INFO Namenode. Fsnamesystem:fsowner=root,root
12/02/06 00:46:50 INFO Namenode. Fsnamesystem:supergroup=supergroup
12/02/06 00:46:50 INFO Namenode. Fsnamesystem:ispermissionenabled=true
12/02/06 00:46:50 INFO Common. Storage:imagefile of size 94 saved in 0 seconds.
12/02/06 00:46:50 INFO Common. STORAGE:STORAGEDIRECTORY/OPT/HADOOP/HADOOPFS/NAME1 has been successfully formatted.
12/02/06 00:46:50 INFO Common. Storage:imagefile of size 94 saved in 0 seconds.
12/02/06 00:46:50 INFO Common. Storage:storagedirectory/opt/hadoop/hadoopfs/name2 has been successfully formatted.
12/02/06 00:46:50 INFO Namenode. Namenode:shutdown_msg:
/************************************************************
Shutdown_msg:shutting down NameNode atv-jiwan-ubuntu-0/127.0.0.1
************************************************************/
View output to ensure successful format of distributed File system
After execution, you can see the/home/hadoop//name1 and/home/hadoop//name2 two directories on the master machine. Starting Hadoop on the master node master, the master node initiates all Hadoop from the node.
1.9.2 (Hadoop user) Start all nodes
starting mode 1:
$ bin/start-all.sh (both HDFs and Map/reduce are started)
System output:
starting Namenode, logging to/usr/local/hadoop/logs/hadoop-hadoop-namenode-ubuntu.out
Node2: Starting Datanode, Loggingto/usr/local/hadoop/logs/hadoop-hadoop-datanode-ubuntu.out
node3:starting Datanode, Loggingto/usr/local/hadoop/logs/hadoop-hadoop-datanode-ubuntu.out
Node1:starting secondarynamenode,logging to /usr/local/hadoop/logs/hadoop-hadoop-secondarynamenode-ubuntu.out
Starting Jobtracker, logging to/usr/local/ Hadoop/logs/hadoop-hadoop-jobtracker-ubuntu.out
node2:starting tasktracker,logging to/usr/local/hadoop/logs/ Hadoop-hadoop-tasktracker-ubuntu.out
node3:starting tasktracker,logging to/usr/local/hadoop/logs/ Hadoop-hadoop-tasktracker-ubuntu.out
As you can see in slave ' s output above, it'll automatically format it ' s storage Directory (specified by Dfs.data.dir) if it isn't formattedalready. It would also create the directory if it does not exist yet.
After execution, you can see the/home/hadoop/hadoopfs/data1 and/home/hadoop/data2 two directories on the master (Node1) and Slave (NODE1,NODE2) machines.
starting mode 2:
Starting a Hadoop cluster requires that the HDFS cluster and the Map/reduce cluster be started.
On the assigned Namenode, run the following command to start HDFs:
$ bin/start-dfs.sh (separate launch of HDFs cluster)
The bin/start-dfs.sh script launches the Datanode daemon on all listed slave, referring to the contents of the ${hadoop_conf_dir}/slaves file on Namenode.
On the assigned Jobtracker, run the following command to start Map/reduce:
$bin/start-mapred.sh (start map/reduce separately)
The bin/start-mapred.sh script launches the Tasktracker daemon on all listed slave, referring to the contents of the ${hadoop_conf_dir}/slaves file on Jobtracker.
1.9.3 Close all nodes
When you close Hadoop from Master node master, the master node shuts down all Hadoop from the node.
$ bin/stop-all.sh
The log of the HADOOP daemon is written to the ${hadoop_log_dir} directory (default is ${hadoop_home}/logs).
${hadoop_home} is the installation path.
1.10 Testing
Note: If you are accessing Windows client. Please add master, slave1 and other IP to host under C:\Windows.
1) Browse the network interfaces of Namenode and Jobtracker, their addresses by default:
namenode-http://node1:50070/
jobtracker-http://node2:50030/
3) Use Netstat–nat to see if ports 49000 and 49001 are in use.
4) Use JPS to view processes
To check if the daemon is running, you can use the JPS command (which is the PS utility for the JVM process). This command lists 5 daemons and their process identifiers.
5) Copy the input files to the Distributed File system:
$ bin/hadoop Fs-mkdir Input
$ bin/hadoop fs-put conf/core-site.xml input
To run the sample program provided by the release version:
$ bin/hadoop jar hadoop-0.20.2-examples.jar grep input Output ' dfs[a-z. +
6. Supplement
Q:bin/hadoop jar hadoop-0.20.2-examples.jar grep input Output ' dfs[a-z. + ' What does that mean?
A:bin/hadoop Jar (run jar package with Hadoop) Hadoop-0.20.2_examples.jar (the name of the jar package) grep (the class to be used, followed by the parameter) input output ' dfs[a-z. +
The entire operation is grep in the Hadoop sample program, and the input directory on the corresponding HDFs is output.
Q: What is grep?
A:A Map/reduce Program This counts the matches of A regex in the input.
To view the output file:
Copy the output file from the Distributed file system to the local file system view:
$ bin/hadoop fs-get Output output
$ cat output/*
Or
To view the output file on a distributed File system:
$ bin/hadoop Fs-cat output/*
Statistical results:
[Email protected]:~/hadoop/hadoop-0.20.2-bak/hadoop-0.20.2#bin/hadoop fs-cat output/part-00000
3 Dfs.class
2 Dfs.period
1 dfs.file
1 dfs.replication
1 dfs.servers
1 dfsadmin
Other viewing tools
Namenode with Java-brought Widgets JPS view processes
[[Email Protected]~]$/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin/jps
18978 Jobtracker
21242 Jps
18899 Secondarynamenode
18731 NameNode
Review the process on each Datanode
[Email protected] ~]$/USR/LIB/JVM/JAVA-1.6.0-OPENJDK-1.6.0.0.X86_64/BIN/JPS
17706 Tasktracker
20499 Jps
17592 DataNode
[Email protected] ~]$/USR/LIB/JVM/JAVA-1.6.0-OPENJDK-1.6.0.0.X86_644/BIN/JPS
28550 Tasktracker
28436 DataNode
30798 Jps
To view the cluster status on Namenode:
[Email protected]]$ Hadoop dfsadmin-report
Configured capacity:123909840896 (115.4 GB)
Present capacity:65765638144 (61.25 GB)
DFS remaining:65765257216 (61.25 GB)
DFS used:380928 (372 KB)
DFS used%: 0%
Under Replicated blocks:0
Blocks with corrupt replicas:0
Missing blocks:0
Datanodes Available:2 (2 total, 0 dead)
Hadoop Configuration Process Practice!