Detailed analysis of the _HADOOP framework for the detailed analysis of Hadoop framework

Source: Internet
Author: User
Tags one table sorts hadoop mapreduce

MapReduce is a model, what kind of pattern? The core computing model of cloud computing, a distributed computing technology, and a simplified distributed programming model, are mainly used to solve the problem of the program development model, but also the solution to the problem of developers.

As shown in the following illustration, the main idea of the MapReduce pattern is to disassemble the problems (such as programs) that are to be performed automatically to map (map) and reduce (simplify), as shown in Figure 1 of the flowchart:

After the data is segmented, the map function program maps the data to different blocks, assigns the computer cluster to handle the distributed operation effect, and then sinks the result by the program of the reduce function, which outputs the results of the developer's needs.

MapReduce The design idea of functional programming language, its software implementation is to specify a map function, mapping the key value pair (Key/value) into a new key-value pair (key/value), forming a series of intermediate result forms of key/value pairs, They are then passed to the reduce (Statute) function, merging the value with the same intermediate form key. The MAP and reduce functions have a certain correlation. The function description is shown in table 1:

MapReduce is committed to solve the problem of large-scale data processing, so at the beginning of the design of the data to consider the local principle of the use of local principles to divide the whole problem. The MapReduce cluster is composed of ordinary PC, which is no sharing architecture. Distribute the dataset to each node before processing. When processed, each node reads locally stored data processing (map), merges the processed data (combine), sorts (shuffle and sort), and then distributes (to the reduce node), avoids the transfer of large amounts of data, and improves processing efficiency. Another benefit of a shared-less architecture is that it is compatible with replication (replication) policies, and clusters can have good fault tolerance, and a portion of the down machine does not affect the normal work of the cluster.

OK, you can take a simple look at the next diagram, the whole picture is about the operation of Hadoop tuning parameters and principles, the left side of the diagram is Maptask operation diagram, the right is Reducetask operation diagram:

As shown above, the map phase, when the map task begins operation and produces intermediate data, is not directly and simply written to disk, it first uses memory buffer to cache the generated buffer, and performs some sort of pre ordering in memory buffer to optimize the performance of the entire map. The reduce stage on the right side of the diagram went through three stages, respectively copy->sort->reduce. We can see clearly that the sort is a merge sort, that is, the merge order.

Hadoop is a mapreduce computing model of open source distributed parallel programming framework, programmers can write programs with Hadoop, the program is written to run on the computer cluster, so as to achieve the processing of massive data.

In addition, Hadoop provides a distributed file System (HDFS) and distributed Database (HBase) for storing or deploying data to individual compute nodes. So, you can generally think: Hadoop=hdfs (file system, data storage technology related) +hbase (database) +mapreduce (processing). The Hadoop framework is shown in Figure 2:

Using the Hadoop framework and cloud computing core technology MapReduce to compute and store the data, and integrate HDFs Distributed File system and HBase distributed database into the cloud computing framework, so as to realize the distributed, parallel computing and storage of cloud computing, and the ability to handle large-scale data is well achieved.

Part of Hadoop

We already know that Hadoop is a Java implementation of Google's MapReduce. MapReduce is a simplified distributed programming model, which allows programs to be distributed automatically to a large cluster of ordinary machines for concurrent execution. Hadoop is mainly composed of HDFs, MapReduce and HBase. The specific Hadoop is composed of the following figure:

From the above figure, we can see:

1, Hadoop HDFs is the Google GFs Storage System Open source implementation, the main application scenario is as a parallel computing environment (MAPREDUCE) of the basic components, but also bigtable (such as HBase, hypertable) The underlying distributed file system. HDFs adopts Master/slave architecture. A HDFs cluster is composed of a namenode and a certain number of datanode. Namenode is a central server responsible for managing file System namespace and client access to files. Datanode is typically a node in a cluster, responsible for managing the storage that comes with them on the node. Internally, a file is actually partitioned into one or more blocks, which are stored in the Datanode collection. As shown in the following illustration (HDFs architecture diagram):

2, Hadoop MapReduce is a simple software framework, based on its written application can run on thousands of commercial machines composed of large clusters, and in a reliable fault-tolerant way to process the terabytes of data sets.

A MapReduce job (job) typically cuts the input dataset into separate blocks of data that are handled in a completely parallel manner by the map task. The framework sorts the output of the map first, and then enters the result into the reduce task. Usually the input and output of the job are stored in the file system. The entire framework is responsible for scheduling and monitoring tasks, as well as for performing tasks that have failed. As shown in the following figure (Hadoop MapReduce process flowchart):

3. Hive is a data Warehouse tool based on Hadoop, which has high processing capability and low cost.

Main Features:

The method of storage is to map a structured data file to a database table. Provides a class SQL language to implement the full SQL query functionality. SQL statements can be converted to MapReduce tasks, which is very suitable for statistical analysis of Data Warehouse.

Deficiencies:

Store and read data in the form of row storage (Sequencefile). Inefficient: The efficiency is low when you need to take out all the data and then extract the data from a column when you want to read a column of data in a datasheet. It also takes up more disk space.

As a result of the above deficiencies, someone (Dr. Charlie) introduces a storage structure that transforms the record-storage structure in a distributed data processing system into a unit, thus reducing the number of disk accesses and improving the performance of query processing. In this way, because the same property values have the same data type and similar data attributes, compressed storage with attribute values is more compressed and saves more storage space. As shown in the following figure (the comparison chart for row and column storage):

4, HBase

HBase is a distributed, column-oriented open source database, which differs from the general relational database and is a database suitable for unstructured data storage. Another difference is that HBase is based on columns rather than on a row based pattern. HBase uses a very similar data model as bigtable. The user stores data rows in one table. A data row has a selectable key and any number of columns, one or more columns form a columnfamily, and a fmaily column is located in a hfile that is easy to cache data. Tables are loosely stored, so users can define a variety of different columns for a row. In HBase, the data is sorted by primary key, and the table is divided into multiple hregion by primary key, as shown in the following illustration (HBase Datasheet chart):

As the following illustration shows, the internal structure of Hadoop, and we can see that massive amounts of data are handed to Hadoop, in the interior of Hadoop, As noted above: Hadoop provides a distributed file System (HDFS) and distributed Database (Hbase) for storage or deployment to various computing points, and ultimately internally takes a mapreduce pattern to process its data, and then outputs the processing results:

Fig. 2-1 Technology architecture of mass data products

As shown above, we can see that the mass data product technology architecture is divided into the following five levels, from top to bottom, they are: data source, computing layer, storage layer, query layer and product layer. Let's take a look at these five layers:

The data source layer. Store the transaction data. The data generated at the data source layer is transmitted via Datax,dbsync and Timetunel to the "ladder" described in the 2nd below.

The calculation layer. In this computing layer, we use the Hadoop cluster, this cluster, which we call the ladder, is the main component of the computing layer. On the ladder, the system carries out different mapreduce calculations for the data products on a daily basis.

Storage layer. In this layer, two things are used, one makes MyFox, the other is prom. MyFox is a cluster of distributed relational databases based on MySQL, and the prom is based on Hadoop Hbase technology (readers should not forget that in the first section above, we've covered one of the components of this Hadoop, hbase- A NoSQL storage cluster in a distributed Open-source database within Hadoop.

The query layer. In this layer, there is a thing called glider, this glider is the interface that provides the restful way externally with the HTTP protocol. The data product obtains the data it wants through a unique URL. At the same time, the data query is through MyFox to query. The following is a detailed description of the MyFox data query process.

Product layer. Simple understanding, not too much introduction.

MyFox

MySQL's MyISAM engine serves as the underlying data storage engine. And in order to deal with the massive data, they designed the distributed MySQL Cluster query agent layer-myfox.

As the following illustration shows, it is the MySQL data query process:

Figure 2-2 MyFox Data query process

In each node of MyFox, there are two kinds of node data of hot node and cold node. As the name suggests, the hot node holds the latest, more frequently accessed data, and cold nodes, which store relatively old, less-frequently accessed data. To store both of these node data, for hardware and storage costs, you will of course consider choosing two different hard drives to store the data for the two different access frequencies. As shown in the following illustration:

Figure 2-3 MYFOX Node structure

"Hot node", select 15000 rpm per minute SAS hard drive, according to one node two machines to calculate, unit data storage cost is about 4.5W/TB. Correspondingly, "Cold data" We selected 7500 rpm drives per minute, which can store more data on a single disc and cost about 1.6W/TB.

Prom

For the sake of article length, this article will not elaborate on this prom. As shown in the following two images, they represent the storage structure of the prom and the prom query process, respectively:

Figure 2-4 the storage structure of prom

Figure 2-5 Prom Query process

Technical framework of Glide

Figure 2-6 Glider's technical framework

In this layer-query layer, it is mainly based on the idea of separating the front and back of the middle tier. Glider this middle tier is responsible for data join and union calculations between heterogeneous tables, and is responsible for isolating front-end products and back-end storage and providing a unified data query service.

Cache

In addition to the separation of the front and back and heterogeneous "table" The role of data integration, glider another role that can not be overlooked is the cache management. One thing we have to understand is that for a specific period of time, we think the data in the data product is read-only, which is the theoretical basis for using caching to improve performance.

As we see in figure 2-6 above, there are two tiers of caching in the glider, respectively, based on a single level cache based on independent requests after the level two cache and consolidation of heterogeneous "tables" (DataSource). In addition, each heterogeneous "table" inside may also have its own caching mechanism.

Figure 2-7 Caching control system

Figure 2-7 shows us the design idea of the data cube in cache control. The user's request must be "command" with cache control, which includes the query string in the URL and the "If-none-match" information in the HTTP header. And, the cache control "command" must pass through layers, eventually passed to the underlying storage of heterogeneous "table" module.

Caching systems often have two problems to face and consider: The avalanche effect of cache penetration and failure.

Cache penetration means querying a data that does not exist, because the cache is written passively when it is not hit, and for fault tolerance, if no data is found from the storage layer, the cache is not written, which causes the nonexistent data to be queried at the storage level for each request, losing the meaning of the cache. As for how to effectively solve the problem of cache penetration, the most common is the use of a bitmap filter (this, as I've described in this article:), hash all the possible data into a large enough one, and a certain nonexistent data will be intercepted by this bitmap, This avoids the query pressure on the underlying storage system.

And in the data cube, in a simpler and more crude way, if a query returns empty data (whether it's a nonexistent data or a system failure), we still cache this empty result, but it will expire in a short period of up to five minutes.

2, the avalanche effect of the cache failure, although the impact on the underlying system is very terrible. Unfortunately, there is no perfect solution for this problem at the moment. Most system designers consider using a lock or queue to ensure that a single thread (process) of the cache is written to avoid the failure of a large number of concurrent requests falling onto the underlying storage system.

In the data cube, the expiration mechanism can theoretically distribute the data expiration time of each client evenly on the time axis, which can avoid the avalanche effect caused by the simultaneous failure of the cache.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.