The map_reduce mechanism of ceilometer, map_reduce
Map/Reduce is an aggregation tool. For example, SQL, mongodb group (by), countdistinct, and so on are all aggregate commands.
Map/Reduce is actually a software framework for implementing the idea of distributed computing. That is, you follow the specifications of this framework. Writing upper-Layer Code can implement your distributed computing and aggregate all the computing results to get a simple result. Applications written based on Map/reduce can run on clusters composed of thousands of servers and process data in parallel in a reliable and fault-tolerant manner.
The specific process is as follows:
Map/Reduce can divide a task into many subtasks that can be processed in parallel. These subtasks are allocated to different servers for parallel computing, when the computing of all servers is complete, the results are aggregated to form a final result.
You can define a map function to process a key/value pair to generate a batch of intermediate key/value pairs, and then define a Reduce function to combine all the values with the same key in the middle.
To put it simply, Map maps a group of data to another group of data one by one, and its ing rules are specified by a function, such as for [1, 2, 3, 4] The ing of multiplication 2 is changed to [2, 4, 6, 8]. Reduce is to normalize a group of data. The normalization rule is specified by a function. For example, the result of the sum of [1, 2, 3, 4] is 10, the result of product reduction is 24.
Map operations operate on each element independently. In other words, Map operations generate a new set of data, while the original data remains unchanged. Therefore, it is highly parallel. Although the Reduce operation is not as good as the concurrency of the Map operation, it will always get a relatively simple result, and large-scale operations are relatively independent, so it is more suitable for parallel operations.
MapReduce tasks are used to process key/value pairs. This framework converts each input record into a key/value pair, and each pair of data is input to the Map job. The Map task outputs a set of key/value pairs. In principle, the input is a key/value pair, but the output can be multiple key/value pairs. It then groups and sorts Map output key/value pairs. Then, the Reduce method is called for each sort key-value pair, and its output is a key value and a set of associated data values. The Reduce method can output any number of key/value pairs, which will be written to the output file under the work output directory. If the Reduce output key value remains the same as the Reduce Input key value, the final output remains sorted.
This framework provides two processes to manage MapReduce jobs:
- TaskTracker manages and executes various Map and Reduce jobs on the computing nodes in the cluster.
- JobTracker accepts job submission, provides job monitoring and control, manages jobs, and assigns jobs to TaskTracker nodes.
Generally, each cluster has a JobTracker process, and each node in the cluster has one or more TaskTracker processes. JobTracker is a key module. If a TaskTracker encounters a problem, JobTracker Schedules other TaskTracker processes to retry.
The Map/reduce algorithm includes the following steps:
1. Partition)
Divide data into N parts
2. Map
In addition to dividing data, you also need to Map the code that computes the data to each computing node for concurrent execution. The N nodes execute their own tasks, and then return the execution results.
3. Partition)
The N execution results need to be merged, So we divide the data again
4. Reduce
The Reduce code and Reduce data are distributed to M nodes for execution. After each node is executed, data is returned. If you need to Reduce again, you can execute it again. Eventually Reduce is a total of results.
In fact, the code we need to write has only two methods: A map method, how to execute each piece of data, a reduce method, and how to merge each piece of data. The framework sorts the output of the map operation and then inputs the result to the reduce task.
Reference diagram:
Summary:
The concept of map/reduce is very simple. In other words, it can be implemented in any language. Google's map/reduce is famous not because of its many clever ideas, but because it sums up distributed computing in a very simple way.
For any distributed computing, the core tasks are: 1. task division 2. Data merging. If you cannot divide tasks, it is useless to use any distributed framework. For example, for clustering Calculation of super-large matrices, if the algorithm itself cannot be divided, it cannot be distributed at all. Therefore, Division is the most important for all distributed issues.