HBase Introduction (1)---data model

Source: Internet
Author: User

http://blog.csdn.net/heyutao007/article/details/5766896

What is BigTable? Google's paper a full description of it. Literally it is a big table, in fact, and we think of the traditional database table is still somewhat different. Loose data can be said to be a data between map Entry (key & Value) and DB row. When I use memcache, sometimes the need is to store more than just a simple key corresponding to a value, perhaps I need to resemble the database table structure in the multi-attribute storage, but there is no traditional database table structure so many correlation relationship requirements, In fact, this kind of data is called loose data. BigTable The most obvious is a very large table, the properties of the table can be dynamically increased according to demand, but there is no table and table related query requirements.

Internet application has one of the biggest features, is the speed, function again powerful, slow, or will be discarded. As a result, both pre-and post-cache caches are taken on large traffic sites to improve performance and response time. For the map entry type of data, centralized distributed cache has a lot of choices, for the traditional relational data, from MySQL to Oracle has been very good support, only loose data such data, both before and after the adoption of two solutions can not maximize its processing power. So BigTable has it.

HBase is a scalable, distributed, and column-oriented dynamic schema database for structured data. It effectively and reliably manages large-scale data (gigabytes or more) across thousands of commodity servers. HBase is modeled on Google's Bigtable database and is a subproject of the Hadoop project of the Apache software Foundation.

The most suitable data stored with HBase is very sparse data (unstructured or semi-structured data). HBase is good at storing this kind of data because HBase is a column-oriented-column-oriented storage mechanism, and the RDBMS we know is a row-oriented-line-oriented storage mechanism ( The depressing is I've seen n this introduction to relational databases never mentions the concept of row-oriented-line-guided storage. In a column-oriented storage mechanism, it is not space-intensive for null to be worth storage. For example, if a table has 10 columns, but only one column of data is stored, then the other 9 columns of the null value are usertable (how does the normal database MySQL occupy the storage space?).
Another reason that hbase is suitable for storing unstructured sparse data is his families processing mechanism for column sets. For example, what is the difference between a dynamic language such as Ruby and Python and a compiled language for C + + and Java classes? The most obvious difference for me is that you don't need to specify a type beforehand for the variable. Ok, now HBase is also bringing this exciting feature to future DBAs, you just need to tell your data to be stored in HBase's that column families, you don't need to specify its specific type: Char,varchar,int,tinyint, Text and so on.

HBase also has many features, such as a join query is not supported, but you can store it by: Parent-child tuple way to solve in disguise.

Note: At the time of this writing, the latest version of HBase is V0.19.3. The information provided in this article applies to this version.

Data model

HBase data is modeled as a multidimensional map, where values (table cells ) are indexed by 4 keys:

Value = Map (TableName, RowKey, Columnkey, Timestamp)

which

    • TableNameis a string.
    • RowKeyAnd ColumnKey is a binary value (Java type byte[] ).
    • Timestampis a 64-bit integer (Java type long ).
    • valueis an unexplained byte array (Java™ type byte[] ).

Binary data is encoded as BASE64 for transmission over the network.

A row key is the primary key of a table, usually a string. Rows are sorted by dictionary order by row keys.

The structure of the information stored in the table is the column family (family), which you can treat as a category . Each column family can have any number of members identified by a label (or modifier ). columnThe key consists of a family name, a : number, and a label. For example, for series info and members date , the column key is info:date .

An HBase table pattern defines multiple column families, but when you insert a row into a table, the application can create new members at run time. For a column family, different rows in the table can have a different number of members. In other words, HBase supports a dynamic model model.

Table 1 shows a simple example of an HBase table named Persons , which has two column families: name and contact .

Contact
Row Key time Stamp Column Family
name
000001 T3 Contact:http research.google.com/people/jeff/
T2 Name:first Jeffrey
T1 Name:last Dean
000002 T5 Name:first Gabriel
T4 Name:last Mateescu

An empty cell does not have a value associated with the key of the cell. In table 1, (000002, contact:http, t4) the cell associated with the key is empty. An empty cell is not stored in HBase, and reading an empty cell is similar to extracting a value from a map based on a nonexistent key. The HBase table adapts to sparse rows in this way.

For any row, you can access only one member of a column family at a time (unlike a relational database, in a relational database, a query can access cells from multiple columns in a row). You can treat a member of a column family in a row as a child row .

The table is decomposed into more than one table area , equivalent to a Bigtable slice (tablet). A region contains rows in a range. Decomposing a table into multiple regions is a key mechanism for efficient processing of large tables.

Each of the tables in HBase is called BigTable. BigTable stores a series of row records with three basic types of definitions: row Key,time stamp,column. Row key is a unique identifier for rows in BigTable, time stamp is the timestamp associated with each data operation, and can be seen as an SVN-like version, with the column defined as: <FAMILY>:<LABEL> With these two parts you can uniquely specify a storage column for a data, the definition and modification of family requires a db-like DDL operation for HBase, and for label use, it does not need to be directly available for definition, which also provides a means for dynamic custom columns. Family another role in fact is the physical storage of optimized read and write operations, the data with family is physically stored closer, so in the business design process can take advantage of this feature.

Take a look at the logical data model:

Row Key

Time Stamp

Column "Contents:"

Column "anchor:"

Column "MIME:"

"Com.cnn.www"

T9

"Anchor:cnnsi.com"

"CNN"

T8

"Anchor:my.look.ca"

"CNN.com"

T6

"

"Text/html"

T5

"

T3

"

There is a column in the table above, the column is uniquely identified as COM.CNN.WWW, and each logical modification has a timestamp association corresponding to a total of four column definitions:<contents:>,<anchor:cnnsi.com>,< Anchor:my.look.ca>,<mime:>. If you use the traditional concept to explain bigtable, then bigtable can be considered a db Schema, each row is a table, row key is the table name, this table according to the different columns can be divided into multiple versions, At the same time, each version of the operation will have a timestamp associated with the row of the operation.

Take a look at the physical data model of HBase:

Row Key

time Stamp

Column   " Contents: "

" com.cnn.www "

t6

" <HTML> ... "

t5

"<HTML> ..."

t3

"<HTML> ..."

Row Key

time Stamp

Column   "anchor:"

"com.cnn.www"

t9

"anchor:cnnsi.com"

"CNN"

t8

" anchor:my.look.ca "

" CNN.com "

Row Key

Time Stamp

Column "MIME:"

"Com.cnn.www"

T6

"Text/html"

The physical data model is essentially the partitioning of a row in a logical model into a physical model stored according to column family.

For bigtable data Model operations, the row is locked and the row's atomic operation is guaranteed.

HBase Introduction (1)---data model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.