Mapreduce: A major regression

Source: Internet
Author: User

This articleArticleIt was written by several database experts of databasecolumn. It briefly introduces mapreduce and compares it with the modern database management system, and points out some shortcomings. This article is purely a learning translation. It does not mean that you fully agree with the original article. Please read it dialectically.

 

In January 8, readers of a database column asked us about the new distributed database research results. Here we will talk about our views with mapreduce. Now is a good time to discuss this issue, because the media has recently been filled with information about the new revolution called "cloud computing. This mode requires a large number of (low-end) processors to work in parallel to solve the computing problem. In fact, we recommend that you use a large number of low-end processors to build a data center, rather than a large number of high-end servers.

For example, IBM and Google have announced plans to provide clusters built with 1000 processors to some universities to teach students how to program on these clusters using mapreduce tools. The University of California Berkeley even plans to offer a mapreduce framework programming course. We are hyping over how mapreduce supporters can develop more scalable and data-intensive products.ProgramShocked. Mapreduce may be a good idea for some specific types of general computing, but for DatabasesCommunityFor example:

1. The large-scale data application model is a huge regression.
2. It is not an optimal implementation because it uses brute force to replace indexes.
3. It is not novel at all. It only implements a well-known technology that was available 25 years ago.
4. lost most of the features of the current database management system.
5. It is not compatible with all tools on which database management system users are currently dependent.

First, we will briefly discuss what mapreduce is, and then we will have a deeper discussion on the above five points.


What is mapreduce?

The basic starting point of mapreduce is easy to understand. It is composed of two user programs called map and reduce. Then, the framework is used to run multiple program instances on the computer cluster as needed to process various sub-tasks, and then the results are merged.

The map program reads a set of "records" from the input stream, filters or transforms the records, and then outputs a set of records (Key, data ). When the map program generates an output record, a division method divides the record into m non-Intersecting blocks and assigns a key value. This division method is generally a hash function, as long as this decisive function can satisfy the needs. When a block is filled, it is written to the disk. At the end of the map program, each block outputs M files.

Generally, program instances with multiple maps run on different nodes in the computer cluster. Each map instance is executed independently by the mapreduce scheduler by allocating an input file that is not repeated. If n nodes participate in the Map Program Execution, each node of N nodes will have m files stored on their respective disks. That is to say, there will be n * m Files in total. Fi, J, 1 ≤ I ≤ n, 1 ≤ j ≤ m.

Note that each map instance must use the same hash method. In this way, all output records with the same hash value will be written to the corresponding output file.

The second stage of mapreduce is to execute M reduce program instances. RJ, 1 ≤ j ≤ m. The input file of each reduce instance RJ is composed of file Fi and J, and 1 ≤ I ≤ n. It is worth noting that all records output from the map stage with the same hash value will be processed by the same reduce instance no matter which map instance is generated. After the map-Reduce framework collects and organizes data, all input records are grouped based on their key values and then provided to the reduce program. Similar to the map program, the reduce program can perform any calculation. Therefore, you can do anything you want for the input record. For example, some additional calculations may be performed on the recorded fields. Each reduce instance can write records to the output file as long as it is the result required by mapreduce calculation.

Using SQL for analogy, the group-by clause in a map aggregate query. Reduce is similar to the aggregate function (such as averaging) of the rows computed by group-).

Now we will discuss the five points mentioned above based on this computing model:

1. mapreduce is a degradation of Database Access

As a data processing model, mapreduce presents a huge regression. The database community has learned the following three lessons over 40 years since IBM first launched IMS in 1968:

* The structure description is good.
* It is good to separate the structure description from the program.
* High-Order Access languages are good.

Mapreduce did not attract any of the above three experiences, and has regressed to the 1960s s before the development of the database management system.

The most important structure description learned by the database management system community is that the recorded fields and their data types are recorded in the storage system. More importantly, the database management system ensures that all records comply with the structure description during operation. This is the best way to avoid adding junk data to a dataset. Mapreduce does not have this method, nor does it avoid adding junk data to the dataset for control. A destroyed dataset can silently destroy the mapreduce program that uses the entire dataset.

Separating data descriptions from programs is also critical. If a developer wants to develop a new program on a dataset, he must first understand the record structure. In the modern database management system, structure descriptions are stored in the system directory and can be understood by users using SQL queries. In contrast, if the data description does not exist or is hidden in the program, the developer must check the originalCode. This work is not only very dull, but developers must first find the programSource code. If no corresponding structure description exists, the dull problem that follows will exist in all mapreduce programs.

In the database management system community in 1970, supporters of relational databases and the Data System Language Association (codasyl) held a "heated debate ". One of the biggest controversies is how the Access Program of the database management system accesses:

* Use statistics to obtain the data you want (relational view)
* ProvidesAlgorithmFor data access (codasyl's point of view)

The outcome of the debate is a history, but the world has seen the value of high-level languages and the victory of relational systems. Programming in the form of a higher-level language is easier to write, modify, and make it easier for new users to understand.CodasylCriticized as "Accessing database management systems in the form of assembly languages ". Mapreduce programmers are a bit similar to codasyl programmers. They use low-level languages to process low-level records. No one advocates regression to assembly languages. Similarly, no one should be forced to use mapreduce for programming.

Mapreduce advocates may oppose the assumption that their datasets do not have data descriptions. Then, let's remove this assertion. When a key is extracted from an input data record, the map method depends on at least one existing data field in each input record. The owner of the same reduce method calculates some values from the processed data received.

Writing mapreduce programs based on Google's bigtable or hadoop's hbase does not change this situation significantly. Using a format of Self-descriptive tuples (row numbers, column names, values) in the same table does have different architectures. However, bigtable and hbase do not provide logical independence. Taking the view mechanism as an example, the view has a significant simplification function. When the logic schema changes, the program continues to run.

2. mapreduce is a rough implementation

All current database management systems use hash or B-tree indexes to accelerate data access. If a user is looking for a sub-record set of a record set (for example, the employee's salary is 10000 or who is in the shoes Production Department ), then he can use indexes to effectively reduce the search range. In addition, a query optimizer is provided to determine whether to use an index or perform a cruel and brutal sequential query.

Mapreduce does not have an index. Of course, it can only use brute force as the processing option. Regardless of whether the index is the best access mechanism in the current situation.

It is worth arguing that mapreduce automatically provides the value of Parallel Computing in computer clusters. In fact, this feature was studied by the database management system research community in the 1980 s, and multiple prototypes were proposed, such as gamma, Bubba, and grace. Commercial exploitation of these ideas in the system was at the end of 1980s, such as teradata.

To sum up, in the past 20 years, a high-performance, commercial SQL engine (with structure descriptions and indexes) has emerged for the grid computer group ). Mapreduce is not as good as these systems.

Mapreduce also has many underlying implementation problems, especially data exchange and data skew.

One factor is that mapreduce supporters seem to have not noticed the problem of data skew. As mentioned in "Parallel Database System: high-performance database system in the future", data skew is a huge obstacle to the successful construction of a high-scalability parallel query system. This problem is reproduced in the MAP Phase when data with the same key is significantly different. This difference, in turn, causes some reduce instances to run longer or more frequently than other instances. The result is that the computing running time is determined by the reduce instance with the slowest speed. The parallel database community has extensively studied this problem and has mature solutions that the mapreduce community may be willing to adopt.

Another serious performance problem was concealed by mapreduce supporters. Recalling that every instance of N map instances generates M output files. Each of them is distributed to different reduce instances. These files are written to the local hard disk for use by the map instance. If n is 1000 and M is 500, 500000 local files will be generated in the MAP Phase. At the beginning of the reduce stage, 500 reduce instances must read 1000 input files and use ftp-like protocols to obtain each input file from the nodes running on each map instance (pull. Within 100 seconds, all reduce instances will run at the same time. Inevitably, two or more reduce instances attempt to obtain input files from the same map node in parallel, including a large number of disk searches, when the factor is greater than 20, the effective transmission rate of the disk will be greatly reduced. This is why the parallel database system uses push to sockets instead of pulling (pull) instead of splitting files ). Because mapreduce achieves excellent fault tolerance by splitting files, it is hard to say whether the mapreduce framework will succeed if it is modified to the push model.

In view of the experimental evaluation, we seriously suspect that mapreduce will perform well in large-scale applications. The real-time users of mapreduce also need to study the research literature on parallel database management systems in the past 25 years.

3. mapreduce is not novel

The mapreduce community seems to have discovered a brand new model for processing large datasets. In fact, mapreduce used at least 20 years ago. The idea of dividing large datasets into small datasets is a new connection algorithm developed based on the application of hash to Data Base Machine and Its architecture proposed by kitsuregawa for the first time. In "multiprocessor hash-based join algorithms", Gerber demonstrates how to extend kitsuregawa technology to a non-shared cluster that uses the Union Partition Table, partition execution, and hash-based splitting to connect to parallel nodes. DeWitt demonstrates how to use these technologies to execute parallel aggregation with a group by clause and without a group by clause. DeWitt and gray describe parallel database systems and how they process queries. Shatdal and Naughton explore alternative strategies for parallel aggregation.

Teradata has been selling database management systems built using these technologies for more than 20 years, and these technologies are exactly what mapreduce team claims to have invented.

Of course, the mapreduce advocates will undoubtedly claim how different their mapreduce functions are to implement their software than to implement them using parallel SQL. We must remind them, postgres supports user-defined functions and user-defined aggregation in the middle of 1980s. Essentially, from 1995 onwards, all modern database systems have provided similar functions for a long time.

4. mapreduce has lost many features

All the following features are provided by the current database management system, but mapreduce does not:

* Batch Import-- Convert the input data to the desired format and load it to the database
* Index-- As described above
* Update-- Change the data in the dataset
* Transactions-- Supports parallel updates and recovery from failed updates
* Comprehensive Constraints-- Prevents junk data from being added to a dataset
* Comprehensive references-- Like FK to prevent the existence of junk data
* View-- The underlying Logical Data description can be changed, but the program does not need to be rewritten.

Simply put, mapreduce only provides the function functions of the current database management system.

5. mapreduce is incompatible with existing database management system tools.

A modern SQL database management system has the following available tools:

* Report-- (For example, a crystal Report) presents data to people.
* Business intelligence tools-- (Such as business objects or Cognos) allows specific queries in the data warehouse
* Data mining tools-- (For example, Oracle Data Mining) allows users to discover data patterns in big data sets
* Copy Tool-- Allows users to copy and transmit data in different databases
* Database Design Tools-- Help Users build databases

Mapreduce cannot use these tools, and it does not have its own tools. Until it is compatible with SQL or someone has compiled these tools, mapreduce still becomes very difficult in end-to-end tasks.

 

Summary

It is exciting to see a huge community involved in designing and implementing scalable query processing technologies. However, we believe that they should not ignore more than 40 years of database technology experience. -- In particular, data description models with huge advantages and physical data and Logical Data are independent of each other. descriptive query languages (such as SQL) provide design, implementation, and maintenance programs. In addition, the computer science community should not be isolated, but should read more technical documents from other communities. We encourage you to study the technical documents of parallel database management systems over the past 25 years. Finally, before mapreduce reaches the modern database management system, many unimplemented features and necessary tools need to be added.

We fully understand that it is okay to use databases to solve their problems. The database community is aware that the database system is too "complicated" to solve their problems ". The database community has also learned a lot from the excellent fault tolerance mechanisms provided by mapreduce. Finally, we noticed that some database researchers began to explore the mapreduce framework to build a scalable database system, Yahoo! The pig project at the site is the result of one of these efforts.


This translation is original, reprinted please pay attention to the source, the original English please view:
http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.