MapReduce is a high-performance batch processing distributed computing framework for parallel analysis and processing of massive data. Compared with traditional data warehousing and analysis techniques, MapReduce is suitable for dealing with various types of data, including structured, semi-structured, and unstructured data. The data is at terabytes and PB levels, and at this level, traditional methods are often unable to process data. MapReduce divides the analysis task into two categories: a large number of parallel Map tasks and a Reduce rollup task. The MAP task runs on multiple servers. The largest cluster currently deployed has 4,000 servers.
MapReduce Tasks for processing
Complex data: Business data does not fit the ranks of the database structure. Data may come from a variety of formats: multimedia, image, text, real-time, sensor data, and so on. New data formats may appear when new data sources are available. MapReduce can store and analyze various raw data formats.
Large-scale data: Many companies simply give up a lot of valuable data for the high cost of storing data. New data sources make the problem worse, with new systems and users bringing more data than ever before. Hadoop's innovative architecture uses low-cost conventional servers to store and process massive amounts of data.
New methods of analysis: massive and complex data analysis requires the use of new methodologies. The new algorithm includes natural language analysis, pattern recognition and so on. Only the architecture of Hadoop can easily and efficiently use new algorithms to process and analyze massive amounts of data.
Core advantages of the MapReduce framework:
1. Highly scalable, can dynamically increase/cut computing nodes, truly flexible calculation.
2. High fault tolerance, support task automatic migration, retry and forecast execution, not affected by compute node failure.
3. Fair scheduling algorithm, support priority and task preemption, take into account long/short tasks, effectively support interactive tasks.
4. The nearest scheduling algorithm, scheduling tasks to the nearest data node, effectively reduce network bandwidth.
5. Dynamic and flexible resource allocation and scheduling to maximize resource utilization, computing nodes will not appear idle and overload, and support resource quota management.
6. After a large number of actual production environment used and validated, the largest cluster size in 4,000 computing nodes.