Experts explain hadoop:hbase loose data storage Design

Source: Internet
Author: User
Tags garbage collection

Http://developer.51cto.com/art/201006/203833.htm


This section together with you to learn about the Hadoop:hbase loose data storage design content, I hope that through this section you can grasp the hadoop:hbase loose data storage design methods.

Preliminary knowledge of Hadoop:hbase loose data storage Design

Recently focused on Hadoop, so I've also been looking at Hadoop-related projects. HBase is an Open-source project based on Hadoop and an implementation of Google's bigtable.
What BigTable is. Google's paper gives a full account of it. Literally is a big table, in fact, and we imagine the traditional database table is still somewhat different. Loose data can be said to be a kind of data between Mapentry (Key&value) and Dbrow. When I use memcache, sometimes the need is to store more than just a simple key corresponding to a value, may I need similar to the database table structure of the multiple attributes of the storage, but there is no traditional database table structure so many related to the needs of the relationship, In fact, this kind of data is called loose data. BigTable The most superficial view is a very large table, the table's properties can be dynamically increased according to demand, but there is no table and table associated with the query needs.

Internet applications have one of the biggest features, is the speed, function again strong, slow, or will be discarded. Therefore, a large number of visits to the site are taken before and after the cache to improve performance and response time. For mapentry type of data, centralized distributed cache has a lot of choices, for traditional relational data, from MySQL to Oracle has been very good support, only loose data such data, the use of both the two solutions can not maximize its processing capacity. So BigTable has it.

HBase as an open source project for Apache is also out of the starting stage, because the Hadoop that it relies on cannot be said to have matured, so there is a lot of room for development, which also provides us with more space for these open source enthusiasts to contribute. Here the main talks to HBase's framework design knowledge and some of its characteristics, whether or not to use hbase to solve the problems in the work, a good process design will always give developers and architects a number of ideological sparks.

HBase Design Introduction

Data model
Every table in the hbase is called BigTable. BigTable stores a series of row records with three basic types of definitions: Rowkey,timestamp,column. Rowkey is the unique identifier of the row in BigTable, timestamp is the corresponding timestamp for each data operation, and can be considered a version similar to SVN, column defined as: <FAMILY>:<LABEL> These two sections allow you to uniquely specify a storage column for a single data, and the definition and modification of family requires a HBase DDL operation, and for the use of labels, you do not need to define the direct use, which also provides a means for dynamically customizing columns. Family another effect is that physical storage optimizes read and write operations, and the data physically stored with the family is relatively close, so you can use this feature in the business design process.

HBase relies on the HDFs of Hadoop as a storage base, so the structure is similar to the master-slave pattern of Hadoop, Hbasemasterserver is responsible for managing all Hregionserver, However, the hbasemasterserver itself does not store any data in the hbase. The HBase logical table is defined as a region stored on a hregionserver, hregionserver and region correspond to a one-to-many relationship. Each hregion is physically divided into three parts: Hmemcache, Hlog, Hstore, respectively, representing the cache, log, and persistence layer. Take a look at the role of these three parts through an update process:


Submit updates and Refresh cache process

As can be seen from the process, the submit update operation will be written to two parts of the entity, Hmemcache and Hlog, Hmemcache is to improve the efficiency of the cache in memory to ensure that some of the most recently manipulated data can be quickly read and modified, The Hlog is the transaction log of synchronous Hmemcache and Hstore, and the data in Flushcache is persisted to Hmemcache when Hregionserver periodically initiates the Hstore command. At the same time, the data in the Hmemecache is emptied, here is a relatively simple strategy to do data caching and synchronization, the complex can actually refer to the Java garbage collection mechanism to do.
When reading region information, read the contents of the Hmemcache first, and then read the data in the Hstore if it is not fetched.

A few details:

1. Because each time flashcache, will produce a hstorefile, in the Hstore store the file will be more and more, will have certain influence to the performance, therefore achieves the setting file quantity threshold time to merge these files to be a big file.

2. Cache size settings and flush time interval settings need to take account of memory consumption and performance impact.

3. Each time the hregionserver is restarted, the data in the hlog that is not flush to the hstore is loaded again into the Hmemcache, so the hmemcache is too large to have a direct impact on the speed of the startup.

4. The B-tree algorithm is used to store the data in the Hstorefile, so it also supports fast locating acquisition of column and family data operations as mentioned earlier.

5. Hregion can be either merge or split, depending on the size of the hregion. However, when you do these things, the hregion are locked out and unusable.

6. Hbasemasterserver through meta-infotable to obtain Hregionserver information and region information, meta at the top of a region is a virtual one called Rootregion, The following actual region can be found by rootregion.

7. The client obtains the regionserver of the region by Hbasemasterserver, and then interacts directly with Regionserver, while the regionserver does not communicate with each other. Only interacts with hbasemasterserver and is monitored and managed by masterserver. This section introduces the content of the Hadoop:hbase loose data storage design.

"Edit Recommendation" two modes run Hadoop distributed Parallel Program experts guide How to do Hadoop distributed cluster configuration Hadoop cluster and Hadoop performance optimization hadoophbase implement simple stand-alone environment Hadoop concept and its usage expert explanation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.