Use the SQL language for the MapReduce framework: use advanced declarative interfaces to make Hadoop easy to use

Source: Internet
Author: User
Tags file system advantage

Brief introduction

Over the past 20 years, the steady increase in computational power has spawned a deluge of data, which in turn has led to a paradigm shift in computing architectures and large data-processing mechanisms. For example, powerful telescopes in astronomy, particle accelerators in physics, and genome sequencing systems in biology have put massive amounts of data into the hands of scientists. Facebook collects 15TB of data every day into a PB-level data warehouse. Demand for large data mining and data analysis applications is increasing in the industry (for example, WEB data analysis, click-Stream analysis, and network monitoring log analysis) and the scientific community (for example, analysis of data generated by large-scale simulations, sensor deployment, and High-throughput lab equipment). Although parallel database systems are suitable for some of these data analysis applications, they are too expensive, unmanageable, and lack of fault tolerance for long-running queries.

MapReduce is a framework introduced by Google for programming commodity computer clusters to perform large data processing in a single pass. The framework is designed to: a MapReduce cluster can be extended to thousands of nodes in a fault-tolerant manner. However, the MapReduce programming model has its own limitations. Its single input, two-stage data flows in strict, and the level is too low. For example, you would have to write custom code even for the most common operations. As a result, many programmers feel that the MapReduce framework is uncomfortable to use, preferring to use SQL as a high-level declarative language. A number of projects have been developed (Apache Pig, Apache Hive, and hadoopdb) to simplify the programmer's tasks and provide high-level declarative interfaces on top of the MapReduce framework.

First take a look at the MapReduce framework, and then look at the capabilities of the different systems that provide advanced interfaces to the MapReduce framework.

MapReduce Framework

A major advantage of the MapReduce framework approach is that it separates applications from the details of running distributed programs, such as data distribution, scheduling, and fault-tolerance issues. In this model, the computational function utilizes a set of input key/value pairs and produces a set of output key/value pairs. Users of the MapReduce framework use two functions to express computations: Map and Reduce. The MAP function uses input pairs and generates a set of intermediate key/value pairs. The MapReduce framework combines all the intermediate values associated with the same intermediate key I (shuffling) and passes them to the Reduce function. The reduce function receives an intermediate key I and a set of values and merges them. Typically, each reduce call produces only 0 or 1 output values. The main advantage of the model is that it supports easy parallelization and a rerun of large computations to be used as the primary fault-tolerant mechanism.

The Apache Hadoop project is an open-source Java software that supports data-intensive distributed applications by implementing the MapReduce framework. It was originally developed by Yahoo! as a replica of Google's MapReduce infrastructure, but then it became open source. Hadoop is responsible for running code across the computer cluster. Generally speaking, when a dataset grows beyond the storage capacity of a single physical machine, it is necessary to partition it across a large number of different computers. File systems that manage storage across computer clusters are called Distributed file systems. Hadoop is accompanied by a distributed file system called HDFS (Hadoop distributed filesystem). In particular, HDFS is a distributed file system that stores files across all nodes of the Hadoop cluster. It divides files into large chunks and distributes them to different computers, making multiple copies of each block, so that if any one computer fails, the data is not lost.

The MapReduce program in Listing 1 uses a pseudo code expression that calculates the number of times each word appears in a line of text sequence. In Listing 1, the map function emits each word and a related occurrence token, while the reduce function summarizes all the tags issued by a particular word.

Listing 1. MapReduce Program

Map (string key, String value):
//key:line number, value:line text for each
word w in value:
     emitintermediate (W,? 1?);
    
Reduce (String key, iterator values):
     //key:a Word, values:a list of counts
int wordCount = 0;
For each V in values:
     wordCount + = parseint (v);
Emit (asstring (WordCount));

We now use the input sequence in Listing 2, the Chinese bank.

Listing 2. Input sequence

1, this is 

Code Example
2, Example color are Red
3, car color is Green

Listing 3 shows the output of the map function for the input.

Listing 3. The output of the map function


(' This ', 1), (' is ', 1).  (' Code ', 1), (' Example ', 1)
(' Example ', 1), (' Color ', 1), (' is ', 1), (' Red ', 1) (' car ', 1), (' Color ', 1), (' Are
', 1), (' Green ', 1)

Listing 4 shows the output of the reduce function (the result).

Listing 4. The output of the reduce function

(' car ', 1), (' Code ', 1), (' Color ', 2), (' Example ', 2), (' Green ', 1), (' Red ', 1), (' It ', 1), (' is ', 3)
 

One of the main features of the MapReduce framework for programmers is that only two high-level declaration primitives (map and reduce) can be written in any programming language that you choose without worrying about the details of their parallel execution. On the other hand, the MapReduce programming model has its own limitations:

Its single input, two-stage data flow in strict. To perform tasks with different data flows (for example, joins or n phases), you must come up with a rough workaround.

Even the most common operations, such as projection and filter, have to write custom code, which makes code often difficult to reuse and maintain.

The opacity of the map and reduce functions limits the ability of the system to perform optimizations.

Moreover, many programmers are unfamiliar with the MapReduce framework, preferring to use SQL (because they are more proficient at it) to express their tasks as high-level declarative languages while leaving all the execution tuning details to the back-end engine. In addition, it is undeniable that high-level language abstraction enables the underlying system to better perform automatic optimizations.

Let's take a look at the languages and systems that are designed to address these issues, and add the SQL class language above the MapReduce framework.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.