MapReduce the basic concepts and origin

Source: Internet
Author: User
Keywords Parallel computing very programming
Tags based basic basic concepts big data computing computing platform data data processing

1. What is MapReduce

MapReduce is a computational model, framework and platform for parallel processing of big data. It implies the following three meanings:

1) MapReduce is a cluster-based high-performance parallel computing platform (Cluster Infrastructure). It allows for the deployment of a distributed and parallel computing cluster of tens, hundreds to thousands of nodes with commercially available commercial servers.

2) MapReduce is a software framework for parallel computing and running. It provides a large but well-designed parallel computing software framework that automates the parallelization of computational tasks, automatically divides computational data and computational tasks, automatically allocates and executes tasks on cluster nodes and collects computations, stores the data distribution , Data communications, fault-tolerant processing, etc. Parallel computing involves many of the underlying complex system details to be handled by the system, greatly reducing the burden on software developers.

3) MapReduce is a Programming Model & Methodology. It provides a simple parallel programming method by means of the design idea of ​​functional programming language Lisp. It realizes basic parallel computing tasks by using Map and Reduce functions and provides abstract operation and parallel programming interfaces Simple and convenient to complete large-scale data programming and computing.

The origin of MapReduce

MapReduce was first proposed by Google company proposed a large-scale data processing for parallel computing models and methods. The original intention of Google's MapReduce design was to address the parallelization of large-scale web data in its search engine. After Google invented MapReduce, it first rewrote the Web document indexing system in its search engine. However, since MapReduce can be universally applied to the calculation of many large-scale data, since Google MapReduce was invented, Google has further extensively applied it to many large-scale data processing issues. To date, thousands of different algorithmic problems and programs within Google Inc. have been processed using MapReduce.

In 2003 and 2004, Google published two papers on Google's distributed file system and MapReduce at international conferences, respectively, and released the basic principles and major design ideas of Google's GFS and MapReduce. Doug Cutting, the founder of Lucene (search index library) and Nutch (search engine), the open source project in 2004, found that MapReduce was the technology they needed to solve massive Web data processing, thus mimicking Google MapReduce, a Java-based design and development An open source MapReduce parallel computing framework and system called Hadoop. Since then, Hadoop has become the most important project under the Apache Open Source organization, and has been rapidly gaining widespread attention from academia and industry around the world since it was introduced and is being promoted and popularized.

The introduction of MapReduce has brought tremendous and revolutionary impact on big data parallel processing, making it the de facto standard for big data processing. Although MapReduce has many limitations, it is generally accepted that MapReduce is by far the most successful, the most widely accepted and easiest to use big data parallel processing technology. The widespread adoption and impact of MapReduce goes far beyond the original expectations of inventors and open source communities that Jimmy Lin, a professor at the University of Maryland and author of "Data-Intensive Text Processing with Map Reduce," published in 2010, Proposed: MapReduce changed the way we organize mass computing, which represents the first computational model that is distinct from the von Neumann structure and is a new abstraction that organizes large-scale computations on a cluster scale rather than on a single machine The first major breakthrough on the model is the most successful computational model based on large-scale computing resources seen so far.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.