Open source framework for distributed computing Introduction to Hadoop practice (II.)

Last Update:2017-02-28 Source: Internet

Author: User

Tags xsl

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In fact, see the official Hadoop document has been able to easily configure the distributed framework to run the environment, but since the write a little bit more, at the same time there are some details to note that the fact that these details will let people grope for half a day. Hadoop can run stand-alone, but also can configure the cluster run, single run will not need to say more, just follow the demo running instructions directly to execute the command. The main point here is to talk about the process of running the cluster configuration.

Environment

7 ordinary machines, operating systems are Linux. Memory and CPU will not say, anyway Hadoop is a big feature is the machine in many not fine. JDK must be more than 1.5, this remember. The machine name of 7 machines must be different, the following talks to the machine name for MapReduce has a great impact.

Deployment considerations

As I described above, there are two broad categories of roles for Hadoop clusters: Master and slave, which are primarily configured with the roles of Namenode and Jobtracker, responsible for the execution of distributed data and decomposition tasks for the explorer, The latter configures the roles of Datanode and Tasktracker, responsible for distributed data storage and task execution. I was going to see if a machine could be configured as Master, but also as a slave, However, it was found that there was a conflict between the machine name configuration during Namenode initialization and Tasktracker execution (Namenode and Tasktracker had some conflicts with the hosts configuration, is the machine name corresponding to the IP on the configuration front or the localhost corresponding IP in front of a bit of a problem, but may also be my own problem, this can be based on the implementation of the situation to give me feedback. Finally, the decision of a master, six sets of slave, subsequent complex application development and test results of the comparison will increase the machine configuration.

Implementation steps

The same directory is created on all machines, and the same user can be created to do the installation path of Hadoop for the user's home path. For example, I built/home/wenchu on all the machines.

Download Hadoop, first extract to master. Here I am the version of the downloaded 0.17.1. At this point the installation path of Hadoop is/home/wenchu/hadoop-0.17.1.

After decompression into the Conf directory, the main need to modify the following documents: Hadoop-env.sh,hadoop-site.xml, masters, slaves.

The underlying configuration file for Hadoop is Hadoop-default.xml, and the code for Hadoop can tell that the job will be created by default when a job is created Config,config first read into the Hadoop-default.xml configuration and then read the Hadoop-sit E.xml configuration (This file is initially configured to be empty), hadoop-site.xml the main configuration you need to cover the Hadoop-default.xml system-level configuration, and you need to use in your mapreduce process of the custom configuration (specific some use such as final reference Document).

The following is a simple hadoop-site.xml configuration:

<?xml version= "1.0" <?xml-stylesheet type= "text/xsl" href= "configuration.xsl" < !--put site-specific property overrides in this file. <configuration> <property> <name>fs.default.name</name>//your namenode configuration 　 , machine name plus port <value>hdfs://10.2.224.46:54310/</value> </property> <property> <name>mapred.job.tracker</name>//your jobtracker configuration, machine name plus port <value>hdfs://10.2.224.46:54311/ </value> </property> <property> <name>dfs.replication</name>//number of data needs to be backed up , the default is three <value>1</value> </property> <property> <name>hadoop.tmp.dir </name>//hadoop default temporary path, this is the best configuration, if the new node or other circumstances inexplicable datanode can not start, delete the TMP directory in this file. However, if this directory is removed from the Namenode machine, then the Namenode formatted command needs to be executed again. <value>/home/wenchu/hadoop/tmp/</value> </property> <property> <name >maprSome parameters of the Ed.child.java.opts</name>//java virtual machine can be referenced by configuring <value>-xmx512m</value> </property <property> The size of the <name>dfs.block.size</name>//block, the unit byte, followed by the use, must be a multiple of 512, Because CRC is used for file integrity verification, the default configuration 512 is the smallest unit of checksum. <value>5120000</value> <description>the default block size for new FILES.</DESCRIPTION&G T </property> </configuration>

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More