Change of data storage mode of relational database to HBase __ Database

Source: Internet
Author: User

I am the title party, in fact, this article and hbase relationship is not, just as the representative of the column family database. From the current wording, HBase is no doubt more attractive than bigtable. The title changed: RDBMS to the column family data storage mode changes more appropriate.

Nowadays, the BigTable (column family) database is used more and more widely, and its function is very powerful. But many people still regard it as a relational database in use, with the original relational database thinking to build tables, storage, query. In this paper, the data schema changes are illustrated with HBase.

Traditional relational database (mysql,oracle) data storage methods are mainly as follows:

Figure I

Pictured above is a typical way to store data, I divide each record into 3 parts: primary key, record attribute, and indexed field . We will index the index fields to achieve the two-level index effect.

But with the development of the business, the query conditions are more and more complex, need more index fields, and many values do not exist, the following figure: Figure II

The figure above is 6 indexed fields, which can be hundreds or even more and need to be brushed against multiple index fields. Query performance is getting lower and even unable to meet query requirements. Limitations in relational data are also beginning to appear, so many people are beginning to touch NoSQL.

Column family database is very powerful, many people want to move data from MySQL to HBase, stored in the same way as figure one or figure two, the primary key for Rowkey. Data from each of the other fields, storing different columns under one column family. However, there is no way to query the index fields, there is no better bigtable based on the two-level indexing scheme, so the index fields can not be queried.

This time can actually convert the thinking, you can turn the data upside down, the following figure:

Figure Three

The values of each index field are used as Rowkey, and then the primary key and attribute values of the records are in the corresponding Rowkey value in a certain order. The simplest way is to have only one column family in the picture above. The records in value can be set to a fixed-length byte[], and multiple record sets are quickly queried by shifting.

However, only queries that are appropriate for a single indexed field are above. If you want to query more than one index field at the same time, Tu San the way to request all value values, such as the query "Zhejiang" and "mobile", you need to take out two value, and then resolve their respective primary key intersection. If you have hundreds of attributes for each record, you have a significant impact on performance.

The next change is to solve the problem of multiple indexed field queries. We store the primary key fields and the property fields separately , and store them in different column families, and the multiple index query only needs to take out the data from the column family 1, and then get the desired value from the smallest set of column family 2. Storage as shown in Figure four:

Figure Four

The above figure data example: Query "Zhejiang" and "mobile":

1, take out "Zhejiang", "Mobile" column family 1 data, that is {1,2,5}, {2,6}

2, the data after the intersection to get {2} to meet the conditions, {2} in the "mobile" (minimum set) index is {1}

3, take out "mobile phone" column family two data, according to the index of Step 2, take out the result {108,2,22234,12} Why is different column family, but not a column family under two columns.

The column family database data files are divided according to the column family. When fetching data, all the column data of a column family are taken out, in fact we do not need to take out the details of the records, so put this part of the data into another column family.

Next is the column Family 2 expansion, column family 2 stores more columns, used to do a variety of brush selection, calculation processing. The following figure: Figure five

Later, I feel like the game is more and more search ...

This is a very typical scheme to change time through space, through a large number of data redundancy to improve query performance. At the same time, there is the problem of data consistency. So the application scenario of this scheme is to do real-time computation on massive historical data. About the application scenario can be seen in an article I wrote before: real-time computing application scenarios

It is also difficult to deal with data that is updated in real time or data that is often modified, and it is also welcome to discuss or join our team to solve these problems.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.