Google engineers define mapreduce as a general http://www.aliyun.com/zixun/aggregation/14345.html "> Data processing process." have been unable to fully understand the true meaning of MapReduce, why MapReduce can "general"?
Recently in the research spark, put aside the spark core memory calculation, here only care about what spark did. All work on the spark is done around the dataset, including creating new datasets, converting to datasets, and making data sets. For the actual application of the data processing process, these spark seems enough, enough to form a general set of data processing process. It is true that spark the DataSet as an object of action, regardless of the type of data set in the dataset-a simple idea!
What about MapReduce? Should MapReduce be abandoned? The MapReduce framework of Hadoop is also criticized for its inefficiency in real-time query based on Hadoop. What I would like to say about this is that this is not a question of mapreduce itself, nor is it all a matter of the mapreduce framework of Hadoop, but, more important, the problem of improper use of mapreduce such as hive. MapReduce innocently said: "I am only responsible for the single wheel mapreduce processing process, you should carefully consider the mapreduce process of data sources and data whereabouts." ”
Now read MapReduce's philosophy. Real-world data is diverse, and before entering information systems, we cannot determine which data is useful or useless for our data query or analysis tasks, and we can only store all the data that can be collected in the most original form. Then came the moment when mapreduce the divinity. MapReduce The first step, Map: classify data into a label for each data that identifies which topic the data belongs to--key or part of the key. After the map process, useless data is filtered, heterogeneous data is represented uniformly, and the data is grouped by topic. Next if you want to query or analyze data for a particular topic, you can take one or more sets of data by topic. MapReduce The second step, Reduce: The data is reduced, the selected data on the implementation of the query or analysis action, output query or analysis results. The reduce process can do a lot of things, including recursively initiating a new mapreduce process. As far as possible, do not return to the user from the reduce process as long as the final query or analysis results have not been generated. Look what hive did, hive. Translate a SQL query command into multiple sequential mapreduce processes, can't you do all of the work in a mapreduce process? Hive's failure is to take mapreduce as a tool rather than a guiding ideology-secular!
MapReduce and Spark, which are not exclusive, may well be combined. My personal idea is to use spark in the mapreduce reduce process to accomplish tasks that require multiple iterations of the dataset to get results, such as SQL queries.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.