Research on Hadoop distributed computing platform and implementation of three servers

Source: Internet
Author: User
Keywords Nbsp; name value
Tags access cat computing computing platform configuration data default directory
Reference article


http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop1/index.html


http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop2/index.html


http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop3/


http://hi.baidu.com/zeorliu/blog/item/3633468235fce8a40cf4d23d.html





according to the Developerworks on these several articles to do, you can put Hadoop configuration up, I am here is not much verbose, the following is my configuration in the process of encountering problems when some records, you can refer to.





--------------20080819------------------


Installation Cygwin


http://bbs.wuyou.com/viewthread.php?tid=119296&extra=page%3D6








Experience Hadoop


$ cd/cygdrive/e/workspace/searchengine/hadoop/hadoop-0.18.0


$ mkdir test-in


$ cd test-in


#在 test-in directory to create two text files, WordCount program will count the number of occurrences of each word


$ echo "Hello World Bye World" >file1.txt


$ echo "Hello Hadoop goodbye Hadoop" >file2.txt


$ CD.


$ bin/hadoop jar Hadoop-0.18.0-examples.jar wordcount test-in test-out


#执行完毕, see execution results below:


$ cd Test-out


$ cat part-00000


Bye 1


Goodbye 1


Hadoop 2


Hello 2


World 2





-------------------------20080822


Pseudo distributed operation mode





This pattern is also run on a single machine, but uses different Java processes to simulate various nodes in a distributed operation (Namenode, DataNode, Jobtracker, Tasktracker, secondary namenode), Note the difference between these nodes in a distributed operation:





from the perspective of distributed storage, nodes in a cluster consist of one namenode and several DataNode, and a secondary namenode as a backup of Namenode. From the point of view of distributed application, the node in the cluster is composed of a jobtracker and several tasktracker, Jobtracker is responsible for the task scheduling, and Tasktracker is responsible for executing the task in parallel. The Tasktracker must be run on the DataNode so that it is easy to compute local data. Jobtracker and Namenode do not need to be on the same machine.





(1) Modify conf/hadoop-site.xml in code Listing 2. Note that Conf/hadoop-default.xml is the default parameter for Hadoop, and you can read this file to see what parameters are available in Hadoop, but do not modify the file. You can change the default parameter value by modifying Conf/hadoop-site.xml, and the parameter values set in this file override Conf/hadoop-default.xml parameters with the same name.





<configuration>


<property>


<name>fs.default.name</name>


<value>localhost:9000</value>


</property>


<property>


<name>mapred.job.tracker</name>


<value>localhost:9001</value>


</property>


<property>


<name>dfs.replication</name>


<value>1</value>


</property>


</configuration>





(2) Configure SSH as shown in Listing 3:


$ ssh-keygen-t dsa-p ' F ~/.SSH/ID_DSA


$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys





$ cd/cygdrive/c/hadoop-0.16.0


$ bin/hadoop Namenode–format





$ bin/start-all.sh


$ ps–ef





$ bin/hadoop dfs-put./test-in input


#将本地文件系统上的./test-in directory to HDFS root directory, directory name changed to input


#执行 Bin/hadoop Dfs–help can learn the use of various HDFS commands.


$ bin/hadoop jar hadoop-0.18.0-examples.jar wordcount Input Output


#查看执行结果:


#将文件从 HDFS to the local file system to view again:


$ bin/hadoop dfs-get output Output


$ cat output/*


#也可以直接查看


$ bin/hadoop dfs-cat output/*


$ bin/stop-all.sh #停止 Hadoop process





Fault Diagnosis





(1) Execute $ bin/start-all.sh After the Hadoop process is started, 5 Java processes are started and five PID files are created in the/tmp directory to record the process ID numbers. Through these five files, you can learn about Namenode, Datanode, secondary namenode, Jobtracker, Tasktracker, respectively, which Java process corresponds to. When you feel that Hadoop is not working properly, you can first see if the 5 Java processes are running correctly.





(2) uses a web interface. Access http://localhost:50030 can view the running state of Jobtracker. Access http://localhost:50060 can view the running state of Tasktracker. Access http://localhost:50070 can view the status of Namenode and the entire Distributed file system, browse files in the Distributed file system, and log.





(3) To view the log files in the ${hadoop_home}/logs directory, Namenode, Datanode, secondary namenode, Jobtracker, tasktracker each have a corresponding log file, Each run of the compute task also has a pair of application log files. Analyzing these log files helps to find the cause of the failure.








---------------------20080825---------------------


Download all versions, take 0.18 as research version, need to download jdk1.6 to compile, to pass








---------------------20080826---------------------------


install IBM MapReduce Tools for Eclipse


1 Configure Hadoop home directory, note that the *core.jar package is required under this directory


2 configuration Run, start Hadoop Server, you need to specify the Cygwin directory way to find Hadoop home





test with 192.168.1.91~93, 91 for linux1,92 linux2,93 for linux3.


linux1 Login to linux2,linux3 via a trusted SSH method to control Tasktracker and Datanode





Hadoop directory:/home/kevin/hadoop-0.18


JDK Catalog:/home/kevin/jdk1.6.0_10





>>>>> Profile


Masters contents are as follows:


linux1





slaves contents are as follows:


linux2


linux3





hadoop-site.xml contents are as follows:





<?xml version= "1.0"?>


<?xml-stylesheet type= "text/xsl" href= configuration.xsl "?>"





<!--put site-specific property overrides in this file. -->


<configuration>


<property>


<name>fs.default.name</name>


<value>hdfs://linux1:9000/</value>


<description>the name of the default file system. Either the literal string


' local ' or a host:port for dfs.</description>


</property>


<property>


<name>mapred.job.tracker</name>


<value>hdfs://linux1:9001/</value>


<description>the host and port that's MapReduce job tracker SETUPCL at. If


"Local", then jobs are run in-process as a single map and reduce task.</description>


</property>


<property>


<name>dfs.name.dir</name>


<value>/home/kevin/hadoopfs/name</value>


<description>determines where on the local filesystem the DFS name node


should store the name table. If This is a comma-delimited list of directories


then the name table is replicated in all of the directories,


for redundancy. </description>


</property>


<property>


<name>dfs.data.dir</name>


<value>/home/kevin/hadoopfs/data</value>


<description>determines where on the local filesystem a DFS data node


should store its blocks. If This is a comma-delimited list of directories,


then data is stored in all named directories and typically on different.


directories that does not exist are ignored.</description>


</property>


<property>


<name>dfs.replication</name>


<value>2</value>


<description>default block replication. The actual number of replications


can be specified when the file is created. The default is used if replication


Is isn't specified in Create time.</description>


</property>


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.