Talking about massive data processing from Hadoop framework and MapReduce model

Source: Internet
Author: User
Tags sorts hadoop mapreduce

A few weeks ago, when I first heard about the first two things about Hadoop and MapReduce, I was slightly excited to think they were mysterious, and the mysteries often brought interest to me, and after reading about their articles or papers, I felt that Hadoop was a fun and challenging technology. , and it also involved a topic I was more interested in: massive data processing.

As a result, in the recent idle time, they are looking at "Hadoop", "MapReduce" "massive data processing" of this paper. But in the process of reading the paper, always feel that those papers are a little, often look very not enjoyable, always a thing just to talk about the urgency, it is over, let me have "resentment."

Although I know a lot about this Hadoop and MapReduce, I still want to keep track of my learning process, and maybe learning about this thing will urge me to eventually write a series of articles in the "Classic Algorithm research family".

Ok, gossip less. This article from the most basic MapReduce model, the Hadoop framework began to talk about, and then by the respective architecture extension, talking about the massive data processing, finally talk about the vast amount of Taobao product technology architecture, in order to both shallow out and in-depth effect, finally, hope to get readers like and support. Thank you.

Because I am the first contact these two things, the article has any questions, welcome to correct me. OK, let's get started.

the first part, the MapReduce model and the Hadoop framework
Architecture Main

In order to read this article, the reader must first clarify the following points as a basis for reading subsequent content: MapReduce is a model. Hadoop is a framework. Hadoop is an open source distributed parallel programming framework that implements the MapReduce pattern.

So, now that you know what mapreduce is, what is Hadoop, and the simplest connection between the two, the main thrust of this article is to summarize: the use of MapReduce in the framework of Hadoop to process massive amounts of data . Below, we can learn and learn more about the two things of MapReduce and Hadoop in turn.

MapReduce Mode

As I said before, MapReduce is a pattern, what kind of pattern? A core computing model of cloud computing, a distributed computing technology, and a simplified distributed programming model, which is mainly used to solve the problem of the program development model, but also the developer of the method of dismantling problems.

Ok, it's useless to say no to the picture. As shown in the following illustration, the main idea of the MapReduce pattern is to disassemble the automatic partitioning of the problem to be performed (for example, a program) into map (map) and reduce (simplify), as shown in Figure 1:

After the data is divided through the map function of the program to map the data into different chunks, allocated to the computer cluster processing to achieve the effect of distributed computing, in the reduce function through the program to the results of the integration, so as to output the results required by the developer.

MapReduce draws on the design idea of functional programming language, and its software implementation is to specify a map function to map key value pairs (Key/value) into new key-value pairs (key/value) to form a series of key/value pairs in the form of intermediate results, They are then passed to the reduce (Statute) function, combining value with the same intermediate form key. Map and reduce functions have a certain correlation. The function description is shown in table 1:

MapReduce is committed to solving the problem of large-scale data processing, so at the beginning of the design, we consider the local principle, and use the local principle to divide and conquer the whole problem. The MapReduce cluster is composed of a common PC, which is a no-sharing architecture. Before processing, the datasets are distributed to the individual nodes. When processing, each node reads locally stored data processing (map), merges the processed data (combine), sorts (shuffle and sort), and then distributes (to the reduce node), avoids the large amount of data transmission, improves the processing efficiency. Another benefit of the no-shared architecture is that with the replication (replication) strategy, the cluster can have good fault tolerance, and a subset of the nodes ' down machines will not affect the normal operation of the cluster.

OK, you can simply look at the next image, the whole picture is about the operation of Hadoop tuning parameters and principles, the left side of the diagram is Maptask operation diagram, the right is the Reducetask operation diagram:

As shown in the figure above, where the map phase is not directly and simply written to the disk when the map task begins to operate and produces intermediate data, it first uses memory buffer to cache the resulting buffer and performs some pre-ordering in memory buffer to optimize the performance of the entire map. The reduce phase on the right side of the graph went through three stages, respectively copy->sort->reduce. We can clearly see that the sort is in the merge sort, that is, merge.

Knowing what is MapReduce, let's look at the open source framework-hadoop that implements the MapReduce pattern.

Hadoop Framework

As I said before, Hadoop is a framework, what kind of framework? Hadoop is an open-source distributed parallel programming framework that implements the MapReduce computing model, and programmers can use Hadoop to write programs that run on computer clusters to handle massive amounts of data.

In addition, Hadoop provides a distributed file System (HDFS) and distributed Database (Hbase) for storing or deploying data to individual compute nodes. So, you can think of it roughly:Hadoop=HDFS(file system, data storage technology related) +HBase(database) +MapReduce(data processing). The Hadoop framework looks like Figure 2:

Compute and store data with the Hadoop framework and cloud core technology MapReduce, and integrate HDFs Distributed File system and HBase distributed database into the cloud computing framework for distributed, parallel computing and storage of cloud computing. and the ability to handle large-scale data is well-implemented.

part of Hadoop

We already know that Hadoop is a Java implementation of Google's MapReduce. MapReduce is a simplified distributed programming model that allows programs to be distributed automatically to a large cluster of ordinary machines for concurrent execution. Hadoop is mainly made up of HDFs, MapReduce, and HBase . The specific Hadoop composition is shown below:

From the above figure, we can see:

1.Hadoop HDFS is an open source implementation of the Google GFs Storage System, the main application scenario is the basic component of the Parallel Computing Environment (MAPREDUCE), and also the bigtable (such as HBase, hypertable) of the underlying distributed file system. HDFs uses the Master/slave architecture. An HDFs cluster is composed of a namenode and a certain number of datanode. Namenode is a central server responsible for managing file System namespace and client access to files. Datanode is typically a node in a cluster and is responsible for managing the storage that comes with them on the node. Internally, a file is actually partitioned into one or more blocks, which are stored in the Datanode collection. As shown in the following figure (HDFs architecture diagram):

2.Hadoop MapReduce is an easy-to-use software framework, based on the applications it writes out to run on a large cluster of thousands of commercial machines, and to process the upper terabytes of data in parallel in a reliable, fault-tolerant way.

A mapreduce job typically divides the input dataset into separate pieces of data that are processed in a completely parallel manner by the Map Task (Task). The framework sorts the output of the map first and then inputs the results to the reduce task. Usually the inputs and outputs of the job are stored in the file system. The entire framework is responsible for scheduling and monitoring tasks, as well as re-executing tasks that have failed. As shown in the following figure (Hadoop MapReduce processing flowchart):

3. Hive is a Hadoop-based data warehousing tool with strong processing power and low cost.

Main features :

The storage method is to map a structured data file to a database table. Provides a class-SQL language for full SQL query functionality. SQL statements can be converted to a MapReduce task run, which is ideal for statistical analysis of data warehouses.


Store and read data in a row-stored manner (sequencefile). Inefficient: When you want to read a column of data in a data table, you need to take out all the data and then extract the data from a column, which is inefficient. At the same time, it takes up more disk space.

As a result of the above shortcomings, someone (Dr. Charlie) introduced a distributed data processing system in the recording unit of the storage structure into a unit of the storage structure, thereby reducing the number of disk access, improve query processing performance. In this way, because the same property values have the same data type and similar data attributes, the compression ratio of the compressed storage is higher in the value of the attribute, which can save more storage space. As shown in the figure below (the comparison chart for row and column storage):

4. HBase

HBase is a distributed, column-oriented, open-source database that differs from the general relational database and is a database suitable for unstructured data storage. The other difference is that HBase is column-based instead of row-based patterns. HBase uses the very same data model as bigtable. The user stores data rows in a table. A data row has an optional key and any number of columns, one or more columns to form a columnfamily, and a column under fmaily in a hfile, which is easy to cache data. Tables are loosely stored, so users can define different columns for rows. In HBase, the data is sorted by primary key, and the table is divided into multiple hregion by primary key, as shown in the following figure (HBase data Table structure chart):

Ok, writing to this, seemingly voluminous near thousands of miles, but if the reader to create a burden of reading, it is not my intention. Next, I will not cite a lot of complex terminology to give readers a negative impact on the heart.

Let me give you a picture of the Hadoop framework and its components described above, as shown in the figure below, which is the internal structure of Hadoop, and we can see that massive amounts of data are handed over to Hadoop, and in the interior of Hadoop, As mentioned above: Hadoop provides a distributed file System (HDFS) and distributed Database (Hbase) to store or deploy to each compute point, ultimately processing its data internally with a mapreduce pattern, and then outputting processing results:

The second part, Taobao massive data product technical framework interpretation-learn the experience of mass data processing

In the first part of this article, we have an in-depth and comprehensive understanding of the MapReduce schema and the Hadoop framework. However, if a thing, or a concept is not put into the actual application, then you will always stay in the idea of the theory, can not move towards the practice.

Ok, Next, the second part of this article, we are based on the data cube technology architecture Taobao, through the introduction of Taobao's massive data product technology architecture, to further learn and understand the experience of mass data processing.

Taobao massive data product technology architecture

As shown in Figure 2-1 below, that is Taobao's massive data product technology architecture, we have to focus on this structure to analyze and interpret.

Believe, read the other articles in this blog carefully reader, will find that figure 2-1 was originally seen in this blog article: from a few architectural drawings stole the slightest amount of data processing experience, and, at the same time, this figure 2-1 was originally published in the August issue of "programmer", Author: Peng Chun.

Before this, it must be explained that: The following is mostly referred to from Mr. Chun's article: Taobao Data cube technical structure of the analysis, my personal work is an interpretation of this article and the key technology and content extraction, in order to better understand the reader Taobao's massive data product technology architecture. At the same time, I can also show my own reading this article of thought and sentiment, incidentally study, why not?

Ok, however, unlike the previous article in this blog (some of the vast amount of data that has been stolen from the architecture diagram), this article is going to elaborate on this architecture. I have also done a lot of preparatory work (such as the figure 2-1 printed down, often pondering):

Figure 2-1 Taobao massive data product technology architecture

OK, as shown above, we can see that Taobao's massive data product technology architecture, divided into the following five levels, from top to bottom, they are: data source, computing layer, storage layer, query layer and product layer. Let's take a look at these five layers: the data source layer. Store the transaction data of Taobao stores. Data generated at the data source layer are transmitted to the "ladder" as described in 2nd below, via Datax,dbsync and Timetunel. Compute layer. In this computing layer, Taobao uses the Hadoop cluster, which we call a ladder, is the main component of the computational layer. On the ladder, the system will perform different mapreduce calculations on the data products every day. Storage tiers. In this layer, Taobao uses two things, one makes MyFox, one is prom. MyFox is a MySQL-based distributed relational database cluster, the prom is based on Hadoop Hbase technology (readers don't forget, in the first part above, we introduced one of the components of Hadoop, hbase- A NoSQL storage cluster in a distributed open source database within Hadoop. Query layer. In this layer, there is a thing called glider, this glider is an interface that provides a restful way out of the HTTP protocol. The data product obtains the data it wants through a unique URL. At the same time, the data query is queried by MyFox.  The following is a detailed description of the MyFox data query process. Product layer. Simple to understand, not to introduce too much.

Next, let's focus on the MyFox and prom in the third tier-storage layer, and then we'll take a look at the technical architecture of glide, and finally, the cache. The article is declared to be over.

We know that relational databases are widely referenced in our current industrial production, including Oracle,mysql, DB2, Sybase, SQL Server, and so on.


Taobao chose MySQL's MyISAM engine as the underlying data storage engine. In order to deal with the massive data, they designed the query agent layer-myfox of the distributed MySQL cluster.

As shown in the following figure, is the MySQL data query process:

Figure 2-2 MyFox's data query process

In each node of MyFox, there are two node data of hot node and cold node. As the name implies, the hot node holds the latest, more frequently accessed data, and cold nodes, which store relatively old, less frequent access data. In order to store the two node data, for hardware conditions and storage cost considerations, of course, you will consider the choice of two different hard disks to store the two different access frequency node data. As shown in the following illustration:

Figure 2-3 MYFOX Node structure

"Hot node", select 15000 rpm per minute SAS hard drive, according to a node two machines to calculate, unit data storage cost is about 4.5W/TB. Correspondingly, "Cold data" we selected 7500 rpm per minute SATA hard drive, on a single disk can hold more data, storage costs about 1.6W/TB.


For the sake of the length of the article, this article will no longer elaborate on this prom. As shown in the following two images, they represent the storage structure of the prom and the prom query process, respectively:

Figure 2-4 Prom Storage structure

Figure 2-5 Prom Query process

Glide's Technical architecture

Figure 2-6 Glider's technical architecture

In this layer-query layer, Taobao is primarily based on the idea of isolating the front and back end of the middle tier. Glider this middle tier is responsible for data join and union calculations between heterogeneous tables, and is responsible for isolating front-end products and back-end storage and providing a unified data query service.


In addition to the role of data integration between the front-end and the heterogeneous "tables", another important part of glider is cache management. One thing we need to know is that, for a certain period of time, we think that data in a data product is read-only, which is a theoretical basis for using caching to improve performance.

In Figure 2-6 above, we see that there are two tiers of cache in glider, respectively, based on each heterogeneous "table" (datasource) level two cache and the first level cache based on separate requests after consolidation. In addition, each heterogeneous "table" inside may also have its own caching mechanism.

Figure 2-7 Cache control System

Figure 2-7 shows us the design idea of the data cube in the cache control. The user's request must have a "command" with cache control, which includes the query string in the URL, and the "If-none-match" message in the HTTP header. Furthermore, this cache control "command" is bound to pass through layers and eventually pass to the heterogeneous "table" module of the underlying storage.

Caching systems often have two problems to confront and consider: The avalanche effect of cache penetration and failure. Cache penetration refers to querying a certain non-existent data, because the cache is not hit when the passive write, and for fault-tolerant consideration, if the data from the storage layer is not written to the cache, which will cause this non-existent data each request to the storage layer to query, lost the meaning of the cache. As for how to effectively solve the problem of buffer penetration, the most common is the use of the bitmap filter (this thing, in my article described:), to hash all possible data into a large enough, a certain non-existent data will be intercepted by this bitmap, This avoids the query pressure on the underlying storage system.

In the data cube, Taobao adopted a more simple and crude method, if a query returned by the data is empty (whether the data does not exist, or a system failure), we still cache the empty result, but its expiration time will be very short, the longest not more than five minutes.

2. The avalanche effect of cache failure is very frightening despite the impact on the underlying system. Unfortunately, there is no perfect solution for this problem at the moment. Most system designers consider locking or queuing to guarantee a single-threaded (process) write of the cache, thus avoiding a large number of concurrent requests falling to the underlying storage system when it fails.

In the data cube, the cache expiration mechanism of Taobao design can theoretically distribute the data failure time of each client evenly on the time axis, to some extent, can avoid the avalanche effect caused by simultaneous invalidation of cache.

This article references: cloud-based massive data storage model, Hou Jian and so on. Massive log data processing based on Hadoop, Wang Xiaosen Large-scale data processing system based on Hadoop, Wang Li Bing. Taobao Data cube Technology architecture analysis, spring. Hadoop operation tuning parameter collation and principle, Guili.

Conclusion : This article reprinted from July's blog, the original address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.