Cluster configuration and usage skills in hadoop-Introduction to the open-source framework of distributed computing hadoop (II)

Last Update:2018-12-05 Source: Internet

Author: User

Tags xsl

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As a matter of fact, you can easily configure the distributed framework runtime environment by referring to the hadoop official documentation. However, you can write a little more here, and pay attention to some details, in fact, these details will be explored for a long time. Hadoop can run on a single machine, or you can configure a cluster to run on a single machine. To run on a single machine, you only need to execute commands directly according to the running instructions of the demo. Here we will focus on the cluster configuration and operation process.

Environment

Seven normal machines, all operating systems are Linux. Memory and CPU are not mentioned. A major feature of hadoop is that many machines are not refined. JDK must be above 1.5. Remember this. The machine names of the seven machines must be different. Later we will talk about the impact of machine names on mapreduce.

Deployment considerations

As I described above, hadoop clusters can be divided into two categories: master and slave. The former mainly configures namenode and jobtracker roles, responsible for the execution of distributed data and decomposition tasks. The latter configures the roles of datanode and tasktracker, and is responsible for Distributed Data Storage and task execution. I was going to check whether a machine can be configured as a master and also used as a slave, however, it is found that the machine name configuration conflicts during namenode initialization and tasktracker execution (namenode and tasktracker have some conflicts with the hosts configuration, whether to put the corresponding IP address of the machine name in front of the configuration or put the corresponding IP address of localhost in front is a bit of a problem, but it may also be my own problem. You can give me feedback based on the Implementation ). In the end, a master node and six Server Load balancer instances are determined. machine configuration will be added for the comparison of complex application development and test results.

Procedure

Create the same directory on all machines. You can also create the same user and use the home path of the user as the hadoop installation path. For example, I have created/home/wenchu on all machines.
Download hadoop and decompress it to the master. Here I download the 0.17.1 version. In this case, the hadoop installation path is/home/wenchu/hadoop-0.17.1.
After decompression into the conf directory, the main need to modify the following files: hadoop-env.sh, hadoop-site.xml, masters, slaves. Hadoop's basic configuration file is a hadoop-default.xml, you can see the hadoop code, the default creation of a job will create the config job, config first read the configuration of the hadoop-default.xml, then read the configuration of the hadoop-site.xml (this file is initially configured as null), the main configuration in the hadoop-site.xml you need to override the system-level configuration of the hadoop-default.xml, and the custom configuration that you need to use in your mapreduce process (for specific usage, such as final and other reference documents ).
The following is the configuration of a simple hadoop-site.xml:

[Copy to clipboard]
Code:
<? XML version = "1.0"?>
<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>
<! -- Put site-specific property overrides in this file. -->
<Configuration>
<Property>
<Name> fs. Default. Name </Name> // configure your namenode and add the machine name and port
<Value> HDFS: // 10.2.224.46: 54310/</value>
</Property>
<Property>
<Name> mapred. Job. Tracker </Name> // configure your jobtracker and add the machine name and port
<Value> HDFS: // 10.2.224.46: 54311/</value>
</Property>
<Property>
<Name> DFS. Replication </Name> // number of data to be backed up. The default value is three.
<Value> 1 </value>
</Property>
<Property>
<Name> hadoop. TMP. dir </Name> // default temporary path of hadoop. It is recommended that you delete the tmp directory in this file if the specified datanode cannot be started when a new node or another node is added. However, if the directory of the namenode machine is deleted, you need to re-execute the namenode formatting command.
<Value>/home/wenchu/hadoop/tmp/</value>
</Property>
<Property>
<Name> mapred. Child. java. opts </Name> // you can configure parameters for the Java Virtual Machine.
<Value>-xmx512m </value>
</Property>
<Property>
<Name> DFS. block. size </Name> // block size, in bytes. It will be useful later and must be a multiple of 512 because CRC is used for file integrity verification, by default, 512 is the minimum unit of checksum.
<Value> 5120000 </value>
<Description> the default block size for new files. </description>
</Property>
</Configuration>

The hadoop-env.sh file only needs to modify one parameter:

[Copy to clipboard]
Code:
# The JAVA Implementation to use. required.
Export java_home =/usr/ALI/jdk1.5.0 _ 10
To configure your Java path, remember to use version 1.5 or later to avoid any problems.
Configure the IP address or machine name of the master in the master. If it is a machine name, you need to set it in/etc/hosts. The server Load balancer instance is configured with the Server Load balancer IP address or machine name. If the server name is used, you must set it in/etc/hosts. The following is an example. All IP addresses are configured here:

[Copy to clipboard]
Code:
MASTERS:
10.2.224.46

Slaves:
10.2.226.40
10.2.226.39
10.2.226.38
10.2.226.37
10.2.226.41
10.2.224.36
Create an SSH credential from the master to each slave. Since the master will start all slave hadoop through SSH, one-way or two-way certificates must be established to ensure that the password is not required during command execution. Run ssh-keygen-t rsa on the master and all slave machines. When you run this command, you only need to press Enter. Then it will be in/root /. the certificate file id_rsa.pub is generated under ssh/, and the file on the master machine is copied to the slave through SCP (remember to change the name), for example: SCP root @ masterip:/root /. SSH/id_rsa.pub/root /. SSH/46_rsa.pub, and then run CAT/root /. SSH/46_rsa.pub>/root /. SSH/authorized_keys: Create the authorized_keys file. You can open this file to check whether the RSA public key is used as the key and the user @ IP is used as the value. In this case, you can test that the password is no longer required from Master SSH to slave. The same is true for reverse slave creation. Why is reverse? In fact, if the master is always started and shut down, there is no need to establish a reverse direction, but if you want to disable hadoop in slave, you need to establish a reverse direction.
Copy hadoop on the master to the same directory of each slave through SCP, and modify its hadoop-env.sh according to the different java_home of each slave.
Modify/etc/profile on the master:
Add the following content: (the specific content is modified according to your installation path. This step is only for ease of use)
Export hadoop_home =/home/wenchu/hadoop-0.17.1
Export Path = $ path: $ hadoop_home/bin after modification, execute source/etc/profile to make it take effect.
Execute hadoop namenode-format on the master. This is the first initialization task. It can be regarded as formatting. In the future, apart from the above mentioned, hadoop on the master is deleted. TMP. dir directory. Otherwise, you do not need to execute it again.
Then execute the start-all.sh on the Master, this command can be executed directly, because in 6 has been added to the path, this command is to start HDFS and mapreduce two sections, of course you can also separately start HDFS and mapreduce, respectively, The start-dfs.sh and start-mapred.sh under the bin directory.
Check the logs directory of the master node and check whether the namenode logs and jobtracker logs are properly started.
Check the logs directory of slave to check whether the datanode log and tasktracker log are normal.
If you need to close it, execute the stop-all.sh directly.

The above steps can start the hadoop distributed environment, and then on the master machine to enter the master installation directory, execute hadoop jar hadoop-0.17.1-examples.jar wordcount input path and output path, you can see the word count statistics. Both the input and output paths here refer to the paths in HDFS. Therefore, you can first create an input path in HDFS by copying the directories in the local file system to HDFS:

Hadoop DFS-copyfromlocal/home/wenchu/test-in. Here,/home/wenchu/test-in is the local path, and test-in is the path that will be created in HDFS, after the execution is complete, you can see that the test-In directory already exists through hadoop DFS-ls test-in, and you can view the content in it through hadoop DFS-ls test-in. The output path must not exist in HDFS. After the demo is executed, you can view the content in the output path through hadoop DFS-ls, you can view the content of a specific file by using the hadoop DFS-cat file name.

Experience summary and precautions (this part is a detour that I took some time to use ):

Several CONF configuration files on the master and slave do not need to be fully synchronized. If you are sure to start and shut down through the master, the configuration on the slave machine does not need to be maintained. However, if you want to enable or disable hadoop on any machine, you need to maintain consistency.
The/etc/hosts on the master and slave machines must be configured on all machines in the cluster, even if IP addresses are used in each configuration file. This has suffered a lot of pains. I thought that if the IP address is assigned, the host does not need to be configured. As a result, it is found that it is always stuck when the reduce is executed, and it cannot be continued during the copy process and retries. In addition, if the machine names of two machines in the cluster are repeated, the problem also occurs.
If a problem occurs when a node is added or deleted, delete the hadoop of slave first. TMP. dir, then restart and try again. If it still does not work, simply put the master's hadoop. TMP. dir deletion (meaning data on DFS will also be lost). If you delete the hadoop. TMP. dir, then you need to re-namenode-format.
Configure the number of map tasks and the number of reduce tasks. As mentioned in the previous Distributed File System Design, when a file is put into a distributed file system, it is divided into multiple blocks and placed on each datanode. By default, DFS is used. block. the size should be 64 M. That is to say, if the data you place on HDFS is smaller than 64, there will be only one block, which will be placed in a datanode. You can run the following command: hadoop dfsadmin-report shows the storage of each node. You can also directly go to a datanode to view the Directory: hadoop. tmp. DIR/dfs/data/current to see the blocks. The number of blocks directly affects the number of maps. Of course, you can configure the number of MAP and reduce tasks. The number of maps is usually the same as that of blocks to be processed by HDFS by default. You can also set the map quantity or minimum split size. The actual number is max (min (block_size, data/# maps) and min_split_size ). Reduce can be calculated using this formula: 0.95 * num_nodes * mapred. tasktracker. Tasks. Maximum.

In general, it is best to check the log when talking about the problem or starting it.

Summary of commands in hadoop

This part of content can be understood through the help and introduction of the command. I mainly focus on introducing a few of the commands I use. The hadoop DFS command is followed by a parameter for HDFS operations, which is similar to the Linux Command, for example:

Hadoop DFS-ls is to view the content in the/usr/root directory. By default, if no path is specified, this is the current user path;
Hadoop DFS-rmr xxx is used to delete directories. It is easy to get started with many commands;
The hadoop dfsadmin-Report command allows you to view the datanode situation globally;
Adding parameters after a hadoop job is an operation on the currently running job, such as list and kill;
Hadoop balancer is the aforementioned command for balancing disk loads.

I will not detail anything else.

Related reading:Introduction to the open-source distributed computing framework hadoop-Introduction to the open-source distributed computing framework hadoop (1 ).

Introduction:Chen wenchu is an architect at the R & D center platform of Alibaba software company. The current main work involves the design and implementation of the Alibaba software development platform service framework (ASF), and the design and implementation of the service integration platform (SIP. There is nothing to be good at or proficient in. The only improvement in my work till now is my learning ability and speed. Personal blog: http://blog.csdn.net/cenwenchu79.

(Source/ 文//infoq)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More