I am the title party, in fact, this article and hbase relationship is not, just as the representative of the column family database. From the current wording, HBase is no doubt more attractive than bigtable. The title changed: RDBMS to the column family data storage mode changes more appropriate.
Nowadays, the BigTable (column family) database is used more and more widely, and its function is very powerful. But many people still regard it as a relational database in use, with the original relational database thinking to build tables, storage, query. In this paper, the data schema changes are illustrated with HBase.
Traditional relational database (mysql,oracle) data storage methods are mainly as follows:
Figure I
Pictured above is a typical way to store data, I divide each record into 3 parts: primary key, record attribute, and indexed field . We will index the index fields to achieve the two-level index effect.
But with the development of the business, the query conditions are more and more complex, need more index fields, and many values do not exist, the following figure: Figure II
The figure above is 6 indexed fields, which can be hundreds or even more and need to be brushed against multiple index fields. Query performance is getting lower and even unable to meet query requirements. Limitations in relational data are also beginning to appear, so many people are beginning to touch NoSQL.
Column family database is very powerful, many people want to move data from MySQL to HBase, stored in the same way as figure one or figure two, the primary key for Rowkey. Data from each of the other fields, storing different columns under one column family. However, there is no way to query the index fields, there is no better bigtable based on the two-level indexing scheme, so the index fields can not be queried.
This time can actually convert the thinking, you can turn the data upside down, the following figure:
Figure Three
The values of each index field are used as Rowkey, and then the primary key and attribute values of the records are in the corresponding Rowkey value in a certain order. The simplest way is to have only one column family in the picture above. The records in value can be set to a fixed-length byte[], and multiple record sets are quickly queried by shifting.
However, only queries that are appropriate for a single indexed field are above. If you want to query more than one index field at the same time, Tu San the way to request all value values, such as the query "Zhejiang" and "mobile", you need to take out two value, and then resolve their respective primary key intersection. If you have hundreds of attributes for each record, you have a significant impact on performance.
The next change is to solve the problem of multiple indexed field queries. We store the primary key fields and the property fields separately , and store them in different column families, and the multiple index query only needs to take out the data from the column family 1, and then get the desired value from the smallest set of column family 2. Storage as shown in Figure four:
Figure Four
The above figure data example: Query "Zhejiang" and "mobile":
1, take out "Zhejiang", "Mobile" column family 1 data, that is {1,2,5}, {2,6}
2, the data after the intersection to get {2} to meet the conditions, {2} in the "mobile" (minimum set) index is {1}
3, take out "mobile phone" column family two data, according to the index of Step 2, take out the result {108,2,22234,12} Why is different column family, but not a column family under two columns.
The column family database data files are divided according to the column family. When fetching data, all the column data of a column family are taken out, in fact we do not need to take out the details of the records, so put this part of the data into another column family.
Next is the column Family 2 expansion, column family 2 stores more columns, used to do a variety of brush selection, calculation processing. The following figure: Figure five
Later, I feel like the game is more and more search ...
This is a very typical scheme to change time through space, through a large number of data redundancy to improve query performance. At the same time, there is the problem of data consistency. So the application scenario of this scheme is to do real-time computation on massive historical data. About the application scenario can be seen in an article I wrote before: real-time computing application scenarios
It is also difficult to deal with data that is updated in real time or data that is often modified, and it is also welcome to discuss or join our team to solve these problems.