Now mapreduce/hadoop and related data processing technologies are very popular, so I would like to summarize the advantages of mapreduce here and make a brief comparison between mapreduce and the traditional parallel computing model based on HPC clusters, it is also a summary and sorting of the mapreduce knowledge learned a while ago.
Introduction
As the Internet data volume continues to grow, the requirements for data processing capabilities become increasingly high. When the computing workload exceeds the processing capability of a single machine, parallel computing is a natural solution. Before the emergence of mapreduce, there was already a very mature parallel computing framework like MPI, so why does Google still need mapreduce? What advantages does mapreduce have compared with traditional parallel computing frameworks, this is a concern of this article.
At the beginning of this article, a comparison table of the traditional parallel computing framework and mapreduce is provided, and then an item is analyzed.
|
Traditional |
Mapreduce |
| Cluster architecture/Fault Tolerance |
Shared (shared memory/shared storage) with poor Fault Tolerance |
No shared bandwidth, good Fault Tolerance |
| Hardware/price/scalability |
Blade Server, high-speed network, San, expensive, poor scalability |
Ordinary PC (cheap), cheap, and scalable |
| Programming/learning difficulty |
What + how, hard |
What, simple |
| Applicable scenarios |
Real-time, fine-grained computing, computing-intensive |
Batch Processing, non-real-time, data-intensive
|
Cluster architecture/Fault Tolerance
In traditional parallel computing, computing resources are usually displayed as a logically unified computer. For an HPC cluster composed of multiple blades and San, it is still a computer to show programmers, but this computing has a large number of CPUs and a huge storage and disk capacity. Physically, computing and storage resources are separated from each other. Data is transmitted from a data node through a data bus or a high-speed network to a computing node. This is not a problem for computing-intensive processing with a small amount of data. For data-intensive processing, I/O between computing nodes and storage nodes will become the performance bottleneck of the entire system. Shared architecture causes centralized data placement, resulting in I/O transmission bottlenecks. In addition, due to the coupling and dependency between cluster components, the cluster fault tolerance is poor.
In fact, when the data size is large, the data will reflect certain local characteristics. Therefore, it is not the best practice to store and read data in a unified manner. Mapreduce is committed to solving the problem of large-scale data processing. Therefore, we considered the Data Locality Principle at the beginning of the design and used the Locality Principle to divide and conquer the entire problem. A mapreduce cluster consists of common PCs and is a non-shared architecture. Before processing, the dataset is distributed to each node. During Processing, each node reads locally stored data processing (MAP), and combines and sorts the processed data) and then distribute it to reduce nodes to avoid the transmission of a large amount of data and improve the processing efficiency. Another advantage of the non-shared architecture is that, in combination with the replication policy, the cluster can have good fault tolerance, and the down of some nodes will not affect the normal operation of the cluster.
Hardware/price/scalability
Traditional HPC clusters consist of advanced hardware and are very expensive. To improve the performance of HPC clusters, vertical scaling is usually adopted: that is, use faster CPU, add blades, increase memory, and expand disks. However, this expansion method cannot support long-term computing expansion (it is easy to reach the top) and the upgrade is expensive. Therefore, HPC clusters have poor scalability compared with mapreduce clusters.
A mapreduce cluster consists of common PCs, which have higher cost effectiveness. Therefore, for clusters with the same computing power, the price of mapreduce clusters is much lower. In addition, the nodes in the mapreduce cluster are connected over Ethernet, which provides good horizontal scalability, that is, the processing capability can be improved by adding PC nodes. Yahoo! Has the world's largest hadoop cluster, including more than 4000 nodes (Google's mapreduce cluster should be larger in size, but it seems that no specific figures have been published. If some netizens know this, I hope you will not give me any further information ).
Programming/learning difficulty
Traditional parallel computing models all share the same logic as multithreading models. The biggest problem with this programming model is the difficulty in controlling program behavior. To ensure correct execution results, you must carefully control the access to shared resources and develop a series of synchronization technologies, such as mutex, semaphore, and lock, it also brings about problems such as competition, hunger, and deadlocks. When programmers use traditional parallel computing models for programming, they should not only consider what to do (that is, "what to do": use parallel models to describe the problems to be solved ), we also need to consider the details of program execution (that is, "How to do", many synchronization and communication problems in program execution), which makes parallel programming very difficult. Existing programming models, such as MPI, opencl, and cuda, are encapsulated at a lower level. There are still a lot of program execution details to be processed.
Mapreduce does more processing: mapreduce not only contains the programming model, but also provides a runtime environment to execute mapreduce programs and execute parallel programs in many details, functions such as distribution, merging, synchronization, and monitoring are all handed over to the execution framework. When using mapreduce, programmers only need to consider how to use the mapreduce model to describe the problem (what) without worrying about how the program is executed, which makes mapreduce easy to learn and use.
Applicable scenarios
Speaking of so many good words about mapreduce, is mapreduce a treasure?
The answer is no. At any time, you should not forget the original design of mapreduce: To solve large-scale and non-real-time data processing problems. A large scale determines that the data has local characteristics that can be used (which can be divided) and can be processed in batches. Non-real-time means that the response time can be long and there is sufficient time to execute the program. For example, the following operations:
1. Update search engine sorting (execute PageRank algorithm on the entire Web Graph)
2. Recommendation calculation (recommendation results do not need to be updated in real time, so a periodic update is set at a fixed time point)
The emergence of mapreduce has its Background: With the development of Web, especially the development of SNS and IOT, the amount of data generated by users and sensors on the Web has exploded. Data can only be stored as dead data. Only after analysis and processing can the information contained in the data be obtained, and the knowledge can be summarized from the information. Therefore, data is important, and data processing capabilities are equally important. Traditional HPC cluster-based parallel computing can no longer meet the needs of rapid growth in data processing. Therefore, mapreduce, based on ordinary PC, is born with low cost, high performance, high scalability, and high reliability.