MapReduce the basic concepts and origin

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Parallel computing very programming

Tags based basic basic concepts big data computing computing platform data data processing

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. What is MapReduce

MapReduce is a computational model, framework and platform for parallel processing of big data. It implies the following three meanings:

1) MapReduce is a cluster-based high-performance parallel computing platform (Cluster Infrastructure). It allows for the deployment of a distributed and parallel computing cluster of tens, hundreds to thousands of nodes with commercially available commercial servers.

2) MapReduce is a software framework for parallel computing and running. It provides a large but well-designed parallel computing software framework that automates the parallelization of computational tasks, automatically divides computational data and computational tasks, automatically allocates and executes tasks on cluster nodes and collects computations, stores the data distribution , Data communications, fault-tolerant processing, etc. Parallel computing involves many of the underlying complex system details to be handled by the system, greatly reducing the burden on software developers.

3) MapReduce is a Programming Model & Methodology. It provides a simple parallel programming method by means of the design idea of functional programming language Lisp. It realizes basic parallel computing tasks by using Map and Reduce functions and provides abstract operation and parallel programming interfaces Simple and convenient to complete large-scale data programming and computing.

The origin of MapReduce

MapReduce was first proposed by Google company proposed a large-scale data processing for parallel computing models and methods. The original intention of Google's MapReduce design was to address the parallelization of large-scale web data in its search engine. After Google invented MapReduce, it first rewrote the Web document indexing system in its search engine. However, since MapReduce can be universally applied to the calculation of many large-scale data, since Google MapReduce was invented, Google has further extensively applied it to many large-scale data processing issues. To date, thousands of different algorithmic problems and programs within Google Inc. have been processed using MapReduce.

In 2003 and 2004, Google published two papers on Google's distributed file system and MapReduce at international conferences, respectively, and released the basic principles and major design ideas of Google's GFS and MapReduce. Doug Cutting, the founder of Lucene (search index library) and Nutch (search engine), the open source project in 2004, found that MapReduce was the technology they needed to solve massive Web data processing, thus mimicking Google MapReduce, a Java-based design and development An open source MapReduce parallel computing framework and system called Hadoop. Since then, Hadoop has become the most important project under the Apache Open Source organization, and has been rapidly gaining widespread attention from academia and industry around the world since it was introduced and is being promoted and popularized.

The introduction of MapReduce has brought tremendous and revolutionary impact on big data parallel processing, making it the de facto standard for big data processing. Although MapReduce has many limitations, it is generally accepted that MapReduce is by far the most successful, the most widely accepted and easiest to use big data parallel processing technology. The widespread adoption and impact of MapReduce goes far beyond the original expectations of inventors and open source communities that Jimmy Lin, a professor at the University of Maryland and author of "Data-Intensive Text Processing with Map Reduce," published in 2010, Proposed: MapReduce changed the way we organize mass computing, which represents the first computational model that is distinct from the von Neumann structure and is a new abstraction that organizes large-scale computations on a cluster scale rather than on a single machine The first major breakthrough on the model is the most successful computational model based on large-scale computing resources seen so far.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More