Research on Hadoop distributed computing platform and implementation of three servers

Last Update:2015-03-16 Source: Internet

Author: User

Keywords Nbsp; name value

Tags access cat computing computing platform configuration data default directory

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reference article

http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop1/index.html

http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop2/index.html

http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop3/

http://hi.baidu.com/zeorliu/blog/item/3633468235fce8a40cf4d23d.html

according to the Developerworks on these several articles to do, you can put Hadoop configuration up, I am here is not much verbose, the following is my configuration in the process of encountering problems when some records, you can refer to.

--------------20080819------------------

Installation Cygwin

http://bbs.wuyou.com/viewthread.php?tid=119296&extra=page%3D6

Experience Hadoop

$ cd/cygdrive/e/workspace/searchengine/hadoop/hadoop-0.18.0

$ mkdir test-in

$ cd test-in

#在 test-in directory to create two text files, WordCount program will count the number of occurrences of each word

$ echo "Hello World Bye World" >file1.txt

$ echo "Hello Hadoop goodbye Hadoop" >file2.txt

$ CD.

$ bin/hadoop jar Hadoop-0.18.0-examples.jar wordcount test-in test-out

#执行完毕, see execution results below:

$ cd Test-out

$ cat part-00000

Bye 1

Goodbye 1

Hadoop 2

Hello 2

World 2

-------------------------20080822

Pseudo distributed operation mode

This pattern is also run on a single machine, but uses different Java processes to simulate various nodes in a distributed operation (Namenode, DataNode, Jobtracker, Tasktracker, secondary namenode), Note the difference between these nodes in a distributed operation:

from the perspective of distributed storage, nodes in a cluster consist of one namenode and several DataNode, and a secondary namenode as a backup of Namenode. From the point of view of distributed application, the node in the cluster is composed of a jobtracker and several tasktracker, Jobtracker is responsible for the task scheduling, and Tasktracker is responsible for executing the task in parallel. The Tasktracker must be run on the DataNode so that it is easy to compute local data. Jobtracker and Namenode do not need to be on the same machine.

(1) Modify conf/hadoop-site.xml in code Listing 2. Note that Conf/hadoop-default.xml is the default parameter for Hadoop, and you can read this file to see what parameters are available in Hadoop, but do not modify the file. You can change the default parameter value by modifying Conf/hadoop-site.xml, and the parameter values set in this file override Conf/hadoop-default.xml parameters with the same name.

<configuration>

<property>

<name>fs.default.name</name>

<value>localhost:9000</value>

</property>

<property>

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

(2) Configure SSH as shown in Listing 3:

$ ssh-keygen-t dsa-p ' F ~/.SSH/ID_DSA

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

$ cd/cygdrive/c/hadoop-0.16.0

$ bin/hadoop Namenode–format

$ bin/start-all.sh

$ ps–ef

$ bin/hadoop dfs-put./test-in input

#将本地文件系统上的./test-in directory to HDFS root directory, directory name changed to input

#执行 Bin/hadoop Dfs–help can learn the use of various HDFS commands.

$ bin/hadoop jar hadoop-0.18.0-examples.jar wordcount Input Output

#查看执行结果:

#将文件从 HDFS to the local file system to view again:

$ bin/hadoop dfs-get output Output

$ cat output/*

#也可以直接查看

$ bin/hadoop dfs-cat output/*

$ bin/stop-all.sh #停止 Hadoop process

Fault Diagnosis

(1) Execute $ bin/start-all.sh After the Hadoop process is started, 5 Java processes are started and five PID files are created in the/tmp directory to record the process ID numbers. Through these five files, you can learn about Namenode, Datanode, secondary namenode, Jobtracker, Tasktracker, respectively, which Java process corresponds to. When you feel that Hadoop is not working properly, you can first see if the 5 Java processes are running correctly.

(2) uses a web interface. Access http://localhost:50030 can view the running state of Jobtracker. Access http://localhost:50060 can view the running state of Tasktracker. Access http://localhost:50070 can view the status of Namenode and the entire Distributed file system, browse files in the Distributed file system, and log.

(3) To view the log files in the ${hadoop_home}/logs directory, Namenode, Datanode, secondary namenode, Jobtracker, tasktracker each have a corresponding log file, Each run of the compute task also has a pair of application log files. Analyzing these log files helps to find the cause of the failure.

---------------------20080825---------------------

Download all versions, take 0.18 as research version, need to download jdk1.6 to compile, to pass

---------------------20080826---------------------------

install IBM MapReduce Tools for Eclipse

1 Configure Hadoop home directory, note that the *core.jar package is required under this directory

2 configuration Run, start Hadoop Server, you need to specify the Cygwin directory way to find Hadoop home

test with 192.168.1.91~93, 91 for linux1,92 linux2,93 for linux3.

linux1 Login to linux2,linux3 via a trusted SSH method to control Tasktracker and Datanode

Hadoop directory:/home/kevin/hadoop-0.18

JDK Catalog:/home/kevin/jdk1.6.0_10

>>>>> Profile

Masters contents are as follows:

linux1

slaves contents are as follows:

linux2

linux3

hadoop-site.xml contents are as follows:

<?xml version= "1.0"?>

<?xml-stylesheet type= "text/xsl" href= configuration.xsl "?>"



<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://linux1:9000/</value>

<description>the name of the default file system. Either the literal string

' local ' or a host:port for dfs.</description>

</property>

<property>

<name>mapred.job.tracker</name>

<value>hdfs://linux1:9001/</value>

<description>the host and port that's MapReduce job tracker SETUPCL at. If

"Local", then jobs are run in-process as a single map and reduce task.</description>

</property>

<property>

<name>dfs.name.dir</name>

<value>/home/kevin/hadoopfs/name</value>

<description>determines where on the local filesystem the DFS name node

should store the name table. If This is a comma-delimited list of directories

then the name table is replicated in all of the directories,

for redundancy. </description>

</property>

<property>

<name>dfs.data.dir</name>

<value>/home/kevin/hadoopfs/data</value>

<description>determines where on the local filesystem a DFS data node

should store its blocks. If This is a comma-delimited list of directories,

then data is stored in all named directories and typically on different.

directories that does not exist are ignored.</description>

</property>

<property>

<name>dfs.replication</name>

<value>2</value>

<description>default block replication. The actual number of replications

can be specified when the file is created. The default is used if replication

Is isn't specified in Create time.</description>

</property>

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More