Hadoop Learning < >--hadoop installation and environment variable settings

Source: Internet
Author: User
Tags log log

Hadoop Core Project: HDFS (Hadoop Distributed File System distributed filesystem), MapReduce (Parallel computing framework)
The master-slave structure of the HDFS architecture: The primary node, which has only one namenode, is responsible for receiving user action requests, maintaining the directory structure of the file system, managing the relationship between the file and the block, and the relationship between the block and the Datanode.
From the node, there are many datanodes, responsible for storing files, files are divided into blocks stored on disk (easy to manage, easy to read multiple nodes), in order to ensure data security, the file will have a lot of copies.
MapReduce Master-Slave structure: The main node, only one jobtracker, responsible for receiving client submitted computing tasks, the calculation task is divided into Tasktracker execution, and monitor the implementation of Tasktracker.
From the node, there are many tasktracker that are responsible for performing jobtracker assigned computing tasks, like the relationship between the project Manager (Jobtracker) and the programmer (Tasktracker).

Modify host Name: Modifies the hostname in the current session: hostname zebra
To modify a host name in a configuration file: Vi/etc/sysconfig/network to Zebra
Bind hostname and IP: Execute command vi/etc/hosts, add a line of content, as follows
192.168.1.120 Zebra (Current user name and IP) remain exited
Verify: Ping zebra
Shut down Firewall: Execute command service iptables stop
Validation: Service iptables Status
Turn off automatic operation of the firewall: Execute command, chkconfig iptables off
Verification: Chkconfig--list | grep iptables

JDK installation: Copy files to a Linux system, Cp/media/jdk-1.6.bin/home/zebra
Cd/home/zebra
Grant execute permission on file, chmod u+x jdk1.6.bin
Perform decompression, #./jdk1.6.bin
Configuring the JDK environment variable: vi/etc/profile
Add two lines of content, export java_home=/home/zebra/jdk1.6
Export path=.: $JAVA _home/bin: $PATH (this $path to be added later, or some configuration of the system may not be possible)
Save exit, execute #source/etc/profile, take effect configuration
Verification: Javac-version


Detailed location search default path for Ubuntu system JDK installation.

Or, as follows, manual lookup (the machine may not have the same result, but the idea is the same):
Which Javac
Back to/usr/bin/javac
File/usr/bin/javac
return/usr/bin/javac:symbolic link to '/etc/alternatives/javac '
Then File/etc/alternatives/javac
return/etc/alternatives/javac:symbolic link to '/usr/lib/jvm/java-6-sun/bin/javac '
Then File/usr/lib/jvm/java-6-sun/bin/javac
Returns/usr/lib/jvm/java-6-sun/bin/javac:elf 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (Uses SH Ared Libs), for Gnu/linux 2.2.5, not stripped
So far, the description path is/usr/lib/jvm/java-6-sun/bin/, which is set in eclipse.

Installing Hadoop
Execute command, unzip #tar-ZXVF hadoop.1.0.tar.gz
Execute command #mv hadoop.1.0 Hadoop rename
Execute command #vi. Etc/profile Setting Environment variables
Export Hadoop_home=/home/zebra/hadoop
Modified file, Export path=.: $HADOOP _home/bin: $JAVA _home/bin: $PATH
Save exit
Execute the command source/etc/profile and let the configuration file take effect
Modify the HADOOP configuration file in the $hadoop_home/conf directory
Modify four configuration files, namely Hadoop-env.sh,core-site.xml,hdfa-site.xml,mapred-site.xml.
The first is the HADOOP environment variable script file hadoop-env.sh, which modifies the nineth line of code to #export JAVA_HOME=/HOME/ZEBRA/JDK
Save exit, this is set to Java_home, note the previous # symbol is removed
The second is the Hadoop core configuration file, Core-site.xml, with the following results:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/zebra/hadoop/tmp</value>
<description>hadoop's home directory for running temporary files </description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://zebra:9000</value>
<description> to modify the access path for your own HDFs </description>
</property>
</configuration>
The third is the HDFs configuration file Hdfs-site.xml file with the following results
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description> Storage of Replicas </description>
</property>
</configuration>
The fourth is the MapReduce configuration file Mapred-site.xml, with the following results:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>zebra:9001</value>
<description> Modify the access path to your own Jobtracker </description>
</property>
</configuration>
This is the minimized configuration of the installation pseudo-distribution, followed by the format file system.
HDFs is a file system, so you need to format it before you use it for the first time. Executes the command (under the #hadoop_home/bin path) of HADOOP Namenode–format. Note: Only the first time you start the format. If you really need to format again, please first delete the "$HADOOP _home/tmp" directory files.
After formatting is complete, start the Hadoop program.
command scripts that start Hadoop are all under $hadoop_home/bin/, and all of the following commands no longer have full path names.
Here are three ways to start Hadoop:
First, start all at once:
Execute start-all.sh start Hadoop, observe the console output, you can see the process is starting, respectively, Namenode, Datanode, Secondarynamenode, Jobtracker, Tasktracker, a total of 5, After execution, it does not mean that the 5 processes started successfully, just that the system is starting the process.
We use the JDK command JPS to see if the process has started correctly. Perform the following JPS, if you see these 5 processes, the Hadoop really started successfully. If one or more are missing, you need to find out why.
The command to turn off Hadoop is stop-all.sh.
The above command is the simplest and can start and close all nodes at once. In addition, there are other commands that are initiated separately.
Second, start HDFs and MapReduce separately: Execute command start-dfs.sh, which is to start HDFs separately. After executing the command, the JPS can see that the NameNode, DataNode, and Secondarynamenode three processes are started, which is suitable for only the scenarios where HDFS storage does not use MapReduce for calculations. The command to close is stop-dfs.sh.
Execute command start-mapred.sh, you can start two processes of MapReduce separately. The command to close is stop-mapred.sh. Of course, you can start the MapReduce first, and then start HDFS. This shows that the processes of HDFS and MapReduce are independent of each other and have no dependencies.
Third, start each process individually:
[Email protected] bin]# JPS
14821 Jps
[Email protected] bin]# hadoop-daemon.sh start Namenode
[Email protected] bin]# hadoop-daemon.sh start Datanode
[Email protected] bin]# hadoop-daemon.sh start Secondarynamenode
[Email protected] bin]# hadoop-daemon.sh start Jobtracker
[Email protected] bin]# hadoop-daemon.sh start Tasktracker
[Email protected] bin]# JPS
14855 NameNode
14946 DataNode
15043 Secondarynamenode
15196 Tasktracker
15115 Jobtracker
The command executed by 15303 JPS is "hadoop-daemon.sh start [process name]", which is suitable for adding and removing nodes separately, as seen when installing the cluster environment. This is also our verification method, to see the number of launches through JPS.
We can also through the browser URL: Host Name: 50070 View Namenode node, you can find that he is also webserver service, 50030 is map/reduce processing node.

Resolve this warning: Warning: $HADOOP _home is deprecated.
Add $hadoop_home_warn_suppress=1 to/etc/profile, this line of records can be

Common startup errors for Hadoop:
When using Hadoop to find a problem, first use the JPS command to see if the node you started is correct, and then go to view the log file.
1. Setting Host name Error
Look at the log, you will find the following error
ERROR Org.apache.hadoop.hdfs.server.namenode.NameNode:
Invalid hostname for Server:master
Java.net.UnknownHostException: This is due to incorrect hostname settings, please check the configuration file for the hostname of the settings, is correct.
First use the hostname command to see if the host name is correct;
Then use the More/etc/sysconfig/network command to see if the host name is recorded in the file;
Finally, use the more/etc/hosts command to see if the IP and host name mappings are set
2.IP Setup Error
Look at the log and find the following error
ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:java.net.BindException:
Problembinding to zebra/192.168.1.100:9000:cannot assign requested address
This is caused by an incorrect IP address setting, check that the IP settings of the host are consistent with the IP settings of the configuration file.
Use the more/etc/hosts command to see if the IP and host name mappings are set.
Hostname cannot be underlined, cannot start with a number
Hostname contains underscores, which can also cause startup failures.
When the boot is complete, be sure to use the JPS command to see if all 5 nodes have successfully started. If the node does not start, check the corresponding log log. The default directory for logs is $hadoop_home/logs.

File naming is regular, "hadoop-[Current user name]-[node name]-[hostname].log" We only view Log
End of the file. If the Namenode node is not started, view the Hadoop-root-namenode-zebra.log file.
3. Perform Hadoop formatting multiple times
Symptom: In Hadoop-root-datanode-master.log, there are the following errors:
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:java.io.IOException:
Incompatible Namespaceids in
Cause: Each time Namenode format re-creates a Namenodeid, the directory that the Dfs.data.dir parameter configures contains the ID of the last format creation, and the ID in the directory configured by the Dfs.name.dir parameter is inconsistent.
Namenode format empties the data under Namenode, but does not empty the data under the Datanode, causing the startup to fail, and all that is required is to empty the Dfs.data.dir parameter configuration directory before each fotmat.
Reformat the HDFs command.
4. The firewall is not shutting down
From the local to the HDFs file system upload files, an exception, this problem most likely is that the firewall is not shut down, resulting in node DataNode and node NameNode communication failure.
You can use the Service iptables Status command to turn off the firewall.
When this is turned off, the firewall may restart when the operating system restarts, and the automatic restart function can be turned off. Use Chkconfig iptables off.
5. Errors caused by Safe mode
The error prompts are as follows:
Org.apache.hadoop.dfs.SafeModeException:Cannot Delete ..., Name node is in safe mode
When the Distributed file system starts, there will be a security mode at the beginning, and when the Distributed file system is in Safe mode, the contents of the file system are not allowed to be modified or deleted until the end of safe mode. The Safe mode is to check the validity of the data blocks on each DataNode when the system is started, and to copy or delete some data blocks according to the policy. The run time can also be entered in safe mode through commands. In practice, when the system starts to modify and delete files will also have a safe mode does not allow the modification of the error prompt, only need to wait a while.
If you are anxious, you can perform the Hadoop dfsadmin–safemode leave command to turn off safe mode.

Hadoop Learning < >--hadoop installation and environment variable settings

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.