MapReduce Mapping and simplification algorithm

Source: Internet
Author: User
Keywords function this algorithm used to
Tags developed example file function functional functional programming functional programming language google

MapReduce is a Google-developed C + + programming tool for parallel operations in large datasets (larger than 1TB). The concepts "map" and "Reduce", and their main ideas, are borrowed from the functional programming language, as well as the features borrowed from the Vector programming language. [1]

The current software implementation is to specify a map (mapping) function that maps a set of key-value pairs into a new set of key-value pairs, specifying the concurrency reduction function to ensure that each of the mapped key-value pairs shares the same set of keys.

Mapping and simplification
Simply put, a mapping function is a conceptual list of some independent elements (for example, a list of test scores) for each element of the specified operation (for example, in the previous example, it was found that all students were overrated for one point, and he could define a "minus one" mapping function to fix the error.) )。 In fact, each element is manipulated independently, and the original list is not changed because a new list is created here to hold the new answer. This means that map operations can be highly parallel, which is useful for applications of high performance requirements and for the needs of parallel computing.

The simplification operation refers to the proper merging of elements of a list (continue to look at the previous example, if anyone wants to know what the average score of the class should be). He can define a simplification function by halving the list by adding the elements of the list to their adjacent elements, so that the recursive operation will be divided by the number of the elements, and then the average is divided by the element. )。 Although he is not as parallel as the mapping function, because the simplification always has a simple answer, the large-scale operation is relatively independent, so the simplification function in the highly parallel environment is also useful.

Distribution and reliability
MapReduce provides reliability by distributing large-scale operations of the dataset to each node on the network; Each node periodically reports back the completed work and status updates. If a node remains silent for more than a preset interval, the master node (similar to the master server in Google File system) records the node state as dead and sends the data assigned to the node to another node. Each operation uses the atomic operation of the named file to ensure that no parallel threads are in conflict; When the file is renamed, the system may copy them to a different name than the task name. (Avoid side effects).

Simplifying operations are very similar, however, because of the poor parallelism, the master node will try to dispatch the simplification operation to a node, or to the node where the data need to be manipulated as far as possible; This feature meets the needs of Google because they have enough bandwidth Their internal network does not have so many machines.

Use
Google,mapreduce is used in a very wide range of applications, including "distributed grep, distribution sequencing, Web Connection graph inversion, Word vector per machine, Web Access log analysis, reverse indexing, document clustering, machine learning, statistical based machine translation ..." It is noteworthy that After the MapReduce is implemented, it is used to regenerate Google's entire index and replace the old ad hoc program to update the index.

MapReduce generates a large number of temporary files, and it uses Google's file system to manage and access these files to improve efficiency.

This article from Csdn Blog, reproduced please indicate the source: http://blog.csdn.net/kevin_long/archive/2007/11/08/1872841.aspx

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.