Hadoop White Paper: Introduction to Distributed computing Framework MapReduce

Source: Internet
Author: User
Keywords Hadoop mapreduce
Tags analysis compared complex data computing data data sources data warehousing distributed

MapReduce is a high-performance batch processing distributed computing framework for parallel analysis and processing of massive data. Compared with traditional data warehousing and analysis techniques, MapReduce is suitable for dealing with various types of data, including structured, semi-structured, and unstructured data. The data is at terabytes and PB levels, and at this level, traditional methods are often unable to process data. MapReduce divides the analysis task into two categories: a large number of parallel Map tasks and a Reduce rollup task. The MAP task runs on multiple servers. The largest cluster currently deployed has 4,000 servers.

MapReduce Tasks for processing

Complex data: Business data does not fit the ranks of the database structure. Data may come from a variety of formats: multimedia, image, text, real-time, sensor data, and so on. New data formats may appear when new data sources are available. MapReduce can store and analyze various raw data formats.

Large-scale data: Many companies simply give up a lot of valuable data for the high cost of storing data. New data sources make the problem worse, with new systems and users bringing more data than ever before. Hadoop's innovative architecture uses low-cost conventional servers to store and process massive amounts of data.

New methods of analysis: massive and complex data analysis requires the use of new methodologies. The new algorithm includes natural language analysis, pattern recognition and so on. Only the architecture of Hadoop can easily and efficiently use new algorithms to process and analyze massive amounts of data.

Core advantages of the MapReduce framework:

1. Highly scalable, can dynamically increase/cut computing nodes, truly flexible calculation.

2. High fault tolerance, support task automatic migration, retry and forecast execution, not affected by compute node failure.

3. Fair scheduling algorithm, support priority and task preemption, take into account long/short tasks, effectively support interactive tasks.

4. The nearest scheduling algorithm, scheduling tasks to the nearest data node, effectively reduce network bandwidth.

5. Dynamic and flexible resource allocation and scheduling to maximize resource utilization, computing nodes will not appear idle and overload, and support resource quota management.

6. After a large number of actual production environment used and validated, the largest cluster size in 4,000 computing nodes.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.