I. Purpose
It mainly tests the relationship between the rate of distributed computing in the hadoop cluster and the data size and the number of computing nodes.
II. Environment
Hardware: inspur nf5220.
System: centos 6.1
The master node allocates 4 CPU and 13 Gb memory on the master machine centos.
The remaining three slave nodes are on the KVM virtual machine of the master machine, and the system is centos6.1. Hardware configuration: Memory 1 GB, 4 CPUs, each hard disk with a capacity of GB.
Iii. Steps and test results
First, put the TXT file with the original data size of MB into HDFS. The hive environment is configured for Data Query testing. Because the raw data is too small, we need to test files larger than GB. Therefore, copy 10, 50, 100, 200, 300, 400, and 500 original data to create the corresponding big data file. (That is, for example, copying 100 data sets to form a big data set. One of the fields is the ID, which is 1-100, corresponding to of the original data)
Use hiveql to query the same data respectively, and then record the results of data queries of different sizes. Create a chart. Then add a Server Load balancer computing node, and then use the same hiveql language to query the same dataset and record the corresponding results. Make:
Red indicates the trend line of four computing nodes, and black indicates three nodes. The X axis is the task completion time, and the Y axis is the corresponding data size, that is, the number of copies of the original data. The image shows that: 1. As the capacity of the dataset increases, the efficiency of mapreduce will increase, but the efficiency will decrease once the data is retained. 2. When there are four computing nodes, the efficiency of mapreduce increases significantly as the data volume increases, but there is also a peak value. 3. for data of the same size, the efficiency of the four computing nodes is higher than that of the three.
Of course, there is a large error in the Data. Due to limited conditions, the last added node is different from the previous three configurations. For example, the hard disk capacity of the last added node is relatively large, and the size of my formatted hdfs space is relatively large. This means that this node will have more blocks and more er will be generated when mapreduce is executed. However, if the CPU and other hardware are not improved, the performance of the current node will be dragged. Therefore, the increase of this node does not correspond to a linear increase in speed. But it will always be better than three nodes.
In addition, by analyzing the working conditions of mapreduce nodes, the relationship between the number of ER er of task and performance is also verified.
In hadoop, A tast is used as a scheduling time slice. At any time, as long as there are idle CPU cores and tasks to be scheduled, the tasks can be allocated to idle CPU cores. Divide a node into N Slots, where N equals the number of CPU cores of the node, that is, a node can run tasks with the same number of CPU cores at most.
Slots is a bit like a resource pool. Each task must obtain a slots to run. Slots can be understood as the number of concurrent tasks. Slots are divided into mapper slots and reduce slots, which correspond to the maximum number of ER er and reducer that can be executed in parallel.
The total number of mappers does not change, because the same data query corresponds to the same number of spilts. At the beginning, when I set mapper slots to 4, 19 tasks are involved. When 8 er slots is set to 8, 10 tasks are involved.
19 tasks
10 tasks
Corresponding to 19 tasks. The first 18 er slots are all 4 full. If the number is 19th, the customer is not satisfied. So when the sloer slots is changed to 8, 10 tasks are required, and the tenth task is still not satisfied. It can also be seen that although the number of tasks is small, each task runs more time, because for each task, it is impossible for all mapper to obtain CPU resources for parallel computing at the same time, so they need to wait for scheduling, the overall efficiency is still not improved. Therefore, to improve efficiency, we need to increase the number of computing nodes and reduce the maximum number of parallel mappers for each task (the number should be equal to the number of CPU cores for Parallel Scheduling) to increase the number of tasks, so that tasks are evenly distributed to each node, so that all mappers can perform parallel computing. Improve efficiency.