Understanding MapReduce Philosophy

Source: Internet
Author: User
Keywords Understanding philosophy flow can flow

Google engineers define mapreduce as a general http://www.aliyun.com/zixun/aggregation/14345.html "> Data processing process." have been unable to fully understand the true meaning of MapReduce, why MapReduce can "general"?

Recently in the research spark, put aside the spark core memory calculation, here only care about what spark did. All work on the spark is done around the dataset, including creating new datasets, converting to datasets, and making data sets. For the actual application of the data processing process, these spark seems enough, enough to form a general set of data processing process. It is true that spark the DataSet as an object of action, regardless of the type of data set in the dataset-a simple idea!

What about MapReduce? Should MapReduce be abandoned? The MapReduce framework of Hadoop is also criticized for its inefficiency in real-time query based on Hadoop. What I would like to say about this is that this is not a question of mapreduce itself, nor is it all a matter of the mapreduce framework of Hadoop, but, more important, the problem of improper use of mapreduce such as hive. MapReduce innocently said: "I am only responsible for the single wheel mapreduce processing process, you should carefully consider the mapreduce process of data sources and data whereabouts." ”

Now read MapReduce's philosophy. Real-world data is diverse, and before entering information systems, we cannot determine which data is useful or useless for our data query or analysis tasks, and we can only store all the data that can be collected in the most original form. Then came the moment when mapreduce the divinity. MapReduce The first step, Map: classify data into a label for each data that identifies which topic the data belongs to--key or part of the key. After the map process, useless data is filtered, heterogeneous data is represented uniformly, and the data is grouped by topic. Next if you want to query or analyze data for a particular topic, you can take one or more sets of data by topic. MapReduce The second step, Reduce: The data is reduced, the selected data on the implementation of the query or analysis action, output query or analysis results. The reduce process can do a lot of things, including recursively initiating a new mapreduce process. As far as possible, do not return to the user from the reduce process as long as the final query or analysis results have not been generated. Look what hive did, hive. Translate a SQL query command into multiple sequential mapreduce processes, can't you do all of the work in a mapreduce process? Hive's failure is to take mapreduce as a tool rather than a guiding ideology-secular!

MapReduce and Spark, which are not exclusive, may well be combined. My personal idea is to use spark in the mapreduce reduce process to accomplish tasks that require multiple iterations of the dataset to get results, such as SQL queries.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.