Advantages and disadvantages of mapreduce distributed processing framework
Source: Internet
Author: User
KeywordsDistributed processing very pros and cons
In Google data centers there are large numbers of data to be processed, such as a lot of Web pages crawled by web crawlers (WebCrawler). Since many of these data are PB levels, the process has to be as parallel as possible, and Google has introduced the MapReduce distributed processing framework to address this problem.
Technology Overview
MapReduce itself is derived from functional languages, mainly through "map" and "Reduce (simplify)" These two steps to parallel processing large-scale data sets. First, map will first perform the specified action on each element of a logical list of many independent elements, and the original list will not be changed, creating multiple new lists to hold the processing results of the map. Also means that the map operation is highly parallel. When the map is finished, the system then cleans up (Shuffle) and sorts the newly generated lists, and then the newly created lists reduce, which is the appropriate merging of the elements of a list against the key value. The following figure is the operating mechanism of MapReduce:
Next, a mapreduce example will be used to help you understand this: for example, through the search engine crawler (Spider), a massive web page is crawled from the internet to the local distributed file system, The index system will then perform a parallel map processing of the massive web pages stored in the Distributed file system, generating multiple key-value pairs (key-valuemap) for the Url,value HTML page, and then the system will shuffle these newly generated key-value pairs ( Cleanup), the system then uses the reduce operation to merge the key value pairs based on the same key value (that is, the URL).
points
Speaking of the advantages of MapReduce, there are two main aspects: first, through the mapreduce of this distributed processing framework, not only can be used to deal with large-scale data, but also can hide a lot of tedious details, such as automatic parallelization, load balancing and disaster preparedness management, This will greatly simplify the programmer's development work; second, the MapReduce is very scalable, that is to say, each additional server, it will be able to connect almost the computing power into the cluster, and the past most of the distributed processing framework, in terms of scalability and mapreduce far. The biggest disadvantage of MapReduce is that it does not adapt to real-time application requirements, so in Google's latest real-time caffeine search engine, MapReduce's dominant position has been available for real-time processing percolator system, its specific details, will be covered in the next article in this series.
Related Products
In addition to Google's internal use of MapReduce, as well as the Yahoo team led by Lucene's father Dougcutting development, Apache-managed MapReduce Open source version of Hadoop, and once launched, is greatly welcomed by the industry, and derived from HDFs, zookeeper, Hbase, hive and pig and other products.
actual use case
in the actual work environment, mapreduce this distributed processing framework is often used in distributed grep, distributed sorting, Web Access log analysis, reverse indexing, document clustering, machine learning, data analysis, Based on statistical machine translation and the formation of the entire search engine index and other large-scale data processing work, and has been in many domestic well-known internet companies have been greatly applied inside, such as Baidu and Taobao.
Finally, if you're interested in MapReduce, you can download it and try it on the official Hadoop site.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.