Interpreting the selection and design of BigTable NoSQL Databases

Source: Internet
Author: User
Tags cassandra
Data scale BigTable database systems (such as HBase and Cassandra) are designed to meet the needs of massive data storage. The massive data scale mentioned here refers to the size of data stored in a single table in terabytes or petabytes. A single table is composed of hundreds of billions of rows and hundreds of billions of columns. When we mention the data scale, we have to say that we are currently at NoSQ

Data scale BigTable database systems (such as HBase and Cassandra) are designed to meet the needs of massive data storage. The massive data scale mentioned here refers to the size of data stored in a single table in terabytes or petabytes. A single table is composed of hundreds of billions of rows and hundreds of billions of columns. When we mention the data scale, we have to say that we are currently at NoSQ

Data Scale

BigTable database systems (such as HBase and Cassandra) are designed to meet the needs of massive data storage. The massive data scale mentioned here refers to the size of data stored in a single table in terabytes or petabytes. A single table is composed of hundreds of billions of rows and hundreds of billions of columns. When talking about the data scale, we have to say that in the NoSQL market, the four most popular NoSQL systems are MongoDB, Redis, Cassandra, and HBase. We know that both Cassandra and HBase are BigTable systems and are well-known (with strong support from Facebook, Yahoo, and Twitter ). So why is MongoDB the most popular? Is HBase not good enough? I think the reason is very simple. After all, the data size of most companies is not as big as Facebook and Yahoo, and MongoDB is enough to meet their needs. The Auto-sharding and schema-less functions provided by MongoDB solve the problems that such data companies encounter when using RDBMS.

Data Model

Furthermore, the data model of BigTable-type database systems is relatively simple and generally does not involve multi-table JOIN operations. Under such scale, traditional RDBMS applications are increasingly restricted, and the cost of maintenance and upgrade is getting higher and higher. In addition, because of the share-storage design, the scale-out capability of traditional RDBMS is not strong. The share-storage-based RDBMS is made into a distributed database, and users need to develop the Proxy layer. The above problems make us have to consider NoSQL storage solutions such as BigTable in the face of massive data. So for DBAs who are used to designing schema for RDBMS, the schema design problem for migrating to BigTable NoSQL systems requires a different idea to consider this problem. This article describes how to design the table schema in the BigTable system, and some issues that need to be paid attention during the migration of traditional RDBMS applications to the BigTable system as the data scale expands.

NoSQL databases put scalability first, which will inevitably lead to a certain amount of data redundancy. The relationship between different tables in RDBMS can be expressed through data redundancy. In addition, BigTable systems do not provide complex SQL query tables and various optimization functions, and only provide massive data storage capabilities. Therefore, like in Facebook's unified messaging system, a single row is often used to store all the information of a user. In the BigTable system, the data size that can be stored by one row is very large. Some time ago, there were rumors on Weibo that Apple's siri system uses HBase in the background. I think if it is true, the Personal Assistant Information of a user should also exist in one line. What's more interesting is that Apple's confidentiality work is really good, and it's hitting the West. It is clear that HBase is used. When recruiting, It is not said that Cassandra and MongoDB will have extra points.

In the schema design of the BigTable system, you also need to pay attention to the column family feature. Because BigTable systems are accessed by column families in nature, different columns in the same column family share the same data type. If the same data type is used, the compression ratio of data between disk and memory IO is very high, which is a common advantage of all column-oriented storage systems. When considering the information to be stored in a row, we can store the information in the corresponding columnfamily according to the Data Types of each attribute. Because BigTable is a sparse table system, a certain attribute of a row may not exist in all other rows, but the data type of this attribute (such as int) in other rows, attributes of the same column family are stored together in actual storage.

Non-Standardization

Denormalization is often mentioned in Data Modeling of NoSQL systems, that is, non-standardization. A simple example is to store the relationship between Entity and Entity in RDBMS to the same table in NoSQL. For example, in the standardized data modeling of RDBMS, there are two tables: Student (StudentID, StudentName, Tutor, CourseID), Course (CourseID, CourseName ). In BigTable NoSQL systems, there is only one Student (StudentID, StudentName, Tutor, CourseID, CourseName ). Therefore, in traditional RDBMS, you need to read the information of two tables and then JOIN them to obtain or aggregate the information of some users, in a NoSQL system, you only need to read the information of some users once.

Row Key

Another issue worth attention in schema Design of BigTable systems is the natural ordering of Row. The BigTable system interprets all the Row keys as strings and organizes the Row in alphabetical order of the strings. Therefore, this feature can be used by our schema design. For example, our applications often need to use the index of a certain attribute or the index of several attribute combinations, so this attribute or Attribute combination can be used as the Row Key. This is very similar to the index and composite index in RDBMS, but it exists naturally in BigTable systems. Note that when attribute combinations are used as Row keys in HBase systems, special symbols are required to splice individual components, however, "/" cannot be used as the delimiter for different attributes in the Row Key. We can use "_".

Data Consistency and transactions

In terms of data consistency, in traditional RDBMS systems, the attributes of each column can be normalized to not null, UNIQUE, or CHECK. The RDBMS system guarantees data consistency requirements for users. In BigTable systems, this requirement is not guaranteed at the DB layer, but is ensured by user-layer programs. The open-source HBase system provides row consistency and row atomicity, and generally stores the information of one user in one row. Therefore, the cost of maintaining data consistency is relatively small. If the schema design of BigTable systems is poor, resulting in complex data redundancy, the cost for maintaining data consistency at the application layer is high.

Transaction support for BigTable systems is complicated. Simply put, HBase only supports row-level locks. To implement transaction features similar to RDBMS, HBase and Zookeeper must be combined. I will not discuss this in detail in this article. I will post a special article to discuss Google's paper about Percolator and External Store. These two paper articles mainly discuss how to use a NoSQL system to implement transactions and how to break through NoSQL and SQL.

Index

Indexing is a problem that needs to be considered by every DB system. From the BigTable paper, we can see that it maintains a special Single Column index for each column, allowing you to create multiple column indexes. These indexes are automatically maintained by BigTable and automatically selected by BigTable during query. This is close to RDBMS. In addition to the automatically ordered Row Key as the index, the open-source HBase provides only one automatically maintained secondary index. However, the indexes used for queries must be determined by the application layer. There are many ways to implement HBase's secondary index. It seems that it has recently been related to coprocessor. For details, refer to http://kenwublog.com/hbase-secondary-index-and-join. HBase allows you to create and use Lucene indexes stored on the file system at the same time. For details about how HBase works with Lucene, refer to http://www.infoq.com/articles/javasehbase.

Before March 13, July 21, you will be entitled to the lowest discount if you register for the Oracle global conference with the TechTarget China-specific registration code "TECH13PR! There is also a mobile phone recharge card worth 50 yuan for you to get! For activity details, see http://www.searchdatabase.com.cn/edm/oracle/20130515/index.html

Original article address: I would like to explain the selection and design of BigTable NoSQL databases and thank you for sharing them with me.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.