[DB] basic concepts of HBase

Source: Internet
Author: User
Tags cassandra

What is Hbase?
Before talking about Hase, let's take a look at two concepts: Row-oriented storage and column-oriented storage. For row-oriented storage, I believe everyone should know that the RDBMS we are familiar with is of this type. Row-oriented storage databases are mainly suitable for strict transaction requirements, or the row-oriented storage system is suitable for OLTP. However, according to the CAP theory, traditional RDBMS synchronizes data through strict ACID transactions to achieve strong consistency, this results in the availability andScalabilityBut there are a lot of discountsNoSQLProducts, Including Hbase, are all eventual consistency systems that sacrifice part of consistency for high availability. As if I have mentioned column-oriented storage, what is column-oriented storage? Hbase, Casandra, and Bigtable all belong to column-oriented storage.DistributedStorage system. If you don't understand what Hbase is, it doesn't matter. I will summarize the following:

Hbase is a column-oriented storageDistributedStorage System, which has the advantages of high performanceConcurrencyRead/write operations, while Hbase also transparently splits data, so that the storage itself has a levelScalability.


Hbase Data Model
HBase and Cassandra have similar data models. Their ideas are all from Google's Bigtable, so the data models of these three are very similar. The only difference is that Cassandra has the concept of Super cloumn family, I have not found Hbase yet. Let's take a look at what Hbase's data model is.

There are two main concepts in Hbase: Row key and Column Family. Let's take a look at Column family and Column family, also known as "Column Family ", column family is pre-defined before the system starts. Each Column Family can have multiple columns according to the "qualifier. the example below will be very clear.

If there is a User table in the system and the columns in the User table are fixed according to the traditional RDBMS, for example, the schema defines attributes such as name, age, and sex, user attributes cannot be dynamically added. However, if the column storage system is used, such as Hbase, we can define the User table and then the info column family. The User data can be divided into info: name = zhangsan, info: age = 30, info: sex = male, etc. If you want to add another property later, you only need info: newProperty.

Maybe the previous example is not clear enough. Let's give another example to explain that friends who are familiar with SNS should all know that they have a friend Feed. Generally, they are designed to Feed, we all follow the principle that "someone has done something named XX at a time", but at the same time, we usually reserve keywords, such as sometimes the feed may need a url or the feed needs an image attribute, in this case, the attributes of the feed itself are uncertain. Therefore, it will be very troublesome to use traditional relational databases. Moreover, relational databases will cause a waste of null units, this problem does not occur in column store. In Hbase, if each column unit has no value, it occupies space. The following two images are used to represent the relationship:





It is a Feed table designed by the traditional RDBMS. We can see that the number of columns in the feed is fixed, cannot be increased, and null columns waste space. But let's take a look at the data model diagram of Hbase, Cassandra, and Bigtable. We can see that columns in the Feed table can be dynamically increased and empty columns are not stored, this greatly saves space. The key is that as the system runs, various feeds will appear. We cannot predict the number of feeds in advance, therefore, we cannot determine the number of columns in the Feed table. Therefore, the column-based data model of Hbase, Cassandra, and Bigtable is very suitable for this scenario. Speaking of this, using Hbase also has a very important advantage: the Feed will be automatically split. When the data in the Feed table exceeds a threshold value, hbase Automatically splits data for us. In this way, the query has ScalabilityWith Hbase's weak transactional features, writing to Hbase will become very fast.




The Column family is mentioned above, so what is the Row key I mentioned earlier? In fact, you can understand that row key is the primary key of a Row in RDBMS, however, Hbase does not support conditional queries and Order by queries. Therefore, the design of the Row key is based on your system's query requirements. I also take the Feed column as an example. We generally query some of the latest feeds of a person, therefore, the Row key of the Feed can contain the following three parts: <userId> <timestamp> <feedId>, in this way, when we want to query the most advanced Feed of a person, we can specify Start Rowkey as <userId> <0> <0>, and End Rowkey as <userId> <Long. MAX_VALUE> <Long. MAX_VALUE> to query, and because records in Hbase are sorted by rowkey, the query becomes very fast.


Advantages and disadvantages of Tri-Hbase
One column can be dynamically added, and data is not stored if the column is empty, saving storage space.

2. Hbase Automatically splits data to enable horizontal scalability for data storage.

3 Hbase can provide high ConcurrencyRead/write support

Disadvantages of Hbase:

1. Conditional query is not supported. query by Row key is supported only.

2. Currently, failover of the Master server is not supported. When the Master node is down, the entire storage system will fail.



About Databases ScalabilityA little bit of information:
Http://www.jurriaanpersyn.com/archives/2009/02/12/database-sharding-at-netlog-with-mysql-and-php/

Http://adam.blog.heroku.com/past/2009/7/6/ SQL _databases_dont_scale/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.