[Reprint] Detailed explanation of the simple database hbase in Hadoop

Source: Internet
Author: User
Tags compact

Reprinted from http://www.csdn.net/article/2010-11-28/282614

Data model

The HBase database uses a data model that is very similar to bigtable. The user stores many rows of data in the table. Each data row includes a sortable keyword, and any number of columns. Tables are sparse, so rows in the same table may have very different columns, as long as the user prefers to do so.

The column name is the "< family name >:< label >" form, where < family names > and < tags > can be any string. A table's < family name > collection (also called the "column Family" collection) is fixed unless you use administrator permissions to change the column family of the table. However, you can add new < tags > at any time. HBase stores data on disk according to the column family, so all items in a column family should have the same read/write method.

The write operation is row locked and you cannot lock multiple rows at once. All write operations to rows are atomic by default.

All database update operations have a timestamp. HBase stores only the most recent version of the specified number of data units. The client can query "up-to-date data from a certain moment" or get all the data versions at once.

Conceptual model

Conceptually, a table is a collection of rows, each containing a row keyword (and an optional timestamp), and some columns (sparse) that may have data. The following example is a good illustration of the problem:

Physical model

Conceptually the table is a sparse row/column matrix, but physically, they are stored in columns. This is one of our important design considerations.

The above "conceptual" table is physically stored as follows:

Please note that no empty cells are stored in the diagram above. Therefore, the query timestamp is T8 "content:" will return null, the same query timestamp is T9, "anchor:" The value of "my.look.ca" also returns NULL.

However, if no timestamp is specified, the most recent data values for the specified columns should be returned, and the most recent values are found first in the table, as they are sorted by time. Therefore, the query "contents:" Without specifying a timestamp, will return the T6 moment of data, query "anchor:" the "my.look.ca" without specifying the timestamp, will return the T8 time of the data.

Example

To show how the data is stored on disk, consider the following example:

The program first wrote the line "[0-9]", the column "Anchor:foo", then wrote the line "[0-9]", the column "Anchor:bar", and finally wrote the line "[0-9]", column "Anchor:foo". When the memcache is brushed to disk and the storage is condensed, the corresponding file may be as follows:

Note that the column "Anchor:foo" is stored 2 times (but the timestamp is different), and the new timestamp is in front (so the newest is always first found).

Hregion (Tablet) server

For users, a table is a collection of some data tuples, sorted by row keywords. Physically, the table is divided into multiple hregion (that is, the child table, the tablet). A child table is identified by the name of the table it belongs to and the "first/last" keyword pair. A first/Last keyword is the and the child table contains the [,) range of rows. The entire table is made up of a collection of child tables, each of which is stored in the appropriate place.

All physical data is stored on the DFS of Hadoop, and some child table servers provide data services, usually one computer running only one child table server program. A child table is managed by only one child table server at a time.

When the client is about to perform an update operation, connect the child table server concerned and commit the change to the child table. The submitted data is added to the Hmemcache of the child table and to the hlog of the child table server. Hmemcache stores the most recent update in memory and acts as the cache service. Hlog is a log file on disk that records all update operations. The commit () Call of the client is not returned until the update is written to Hlog.

When providing services, the sub-table first checks Hmemcache. If not, check the Hstore on the disk. Each column family in the child table corresponds to one hstore, and one hstore contains hstorefile files on multiple disks. Each hstorefile has a structure similar to the B-tree, allowing for quick lookups.

We periodically call Hregion.flushcache () and write the contents of the Hmemcache to a Hstore file on disk, which adds a new hstorefile to each hstore. Then empty the Hmemcache and add a special marker to the hlog, which means flush the hmemcache.

At startup, each child table checks if the last Flushcache () call has a write operation not applied in Hlog. If not, all the data in the sub-table is the data in the Hstore file on the disk, and if so, then the child tables reapply the updates in the Hlog, write to the Hmemcache, and call Flushcache (). The last child table deletes the Hlog and begins the data service.

So, the fewer calls to Flushcache (), the less work, and the more memory space the Hmemcache takes up, the more time it takes hlog to recover the data when it starts. If you call Flushcache () more frequently, Hmemcache consumes less memory and Hlog recovers data faster, but the cost of Flushcache () also needs to be considered.

The Flushcache () call adds a hstorefile to each hstore. Reading data from a hstore may have to access all of its hstorefile. This is time-consuming, so we need to combine multiple hstorefile into a hstorefile at timed intervals, by calling Hstore.compact ().

Google's bigtable paper is somewhat vague about the main austerity and secondary austerity descriptions, and we've only noticed 2 things:

1. Once Flushcache () writes all updates from memory to disk. With Flushcache (), we reduced the log rebuild time at startup to 0. Each flushcache () adds a hstorefile file to each hstore.

2. Once the Compact () turns all the hstorefile into one.

Unlike BigTable, the hbase of Hadoop can shorten the time period for update "commit" and "write log" to 0 (i.e. "commit" must be written in the log). It's not hard to achieve as long as it really needs to.

We can call Hregion.closeandmerge () to merge 2 of the sub-tables into one. The 2 sub-tables in the current version are in the "downline" state to be merged.

When a child table is larger than a specified value, the child table server needs to call Hregion.closeandsplit () and divide it into 2 new sub-tables. The new child table is escalated to master, which is the master that determines which child table the server takes over. The segmentation process is very fast, mainly because the new child table maintains a reference to the hstorefile of the old child table, a reference to the first half of the Hstorefile, and a second reference to the second part. When the reference is established, the old child table is marked as "offline" and continues to persist until the old child table is deleted when the new child table's austerity operation clears the references to the old child table.

Summarize:

1. The client accesses the data in the table.

2. The table is divided into many sub-tables.

3. Child tables are maintained by the child table server, and the client connects to the Sub-table server to access row data within a child table keyword range.

4. The child table also includes:

A Hmemcache, storing the most recently updated memory buffers

B HLog, storing the most recently updated log

C Hstore, a group of efficient disk files. One hstore per column family.

The master server for HBase

Each child table server maintains contact with a unique primary server. The primary server tells each child table server which child tables should be loaded and serviced.

The primary server maintains a child table server active tag at any time. If the connection between the primary server and the child table server times out, then:

A The child table server "kills" itself and restarts in a blank state.

B The primary server assumes that the child table server is "dead" and assigns its child tables to other child table servers.

Note that unlike Google's bigtable, their child table servers can continue to service even if they are disconnected from the primary server. We have to "tie" the table server to the primary server because we do not have an extra lock management system like bigtable. In BigTable, the primary server is responsible for assigning child tables, and the lock Manager (Chubby) guarantees child table server atomic access to the child table. HBase uses only one core to manage all child table servers: the primary server.

BigTable There is nothing wrong with doing so. They all depend on a core network structure (hmaster or chubby), and as long as the core is still running, the entire system can run. Perhaps chubby has some special advantages, but this is beyond HBase's current target range.

When the child table server "reports" to a new primary server, the primary server makes each child table server load 0 or more child tables. When the child table server is dead, the primary server marks the child tables as "unassigned" and then tries to give the other child table servers.

Each child table is identified by the name of the table and the range of keywords it belongs to. Since the keyword range is contiguous, and the first and last keywords are null, the keyword range is only identified with the first keyword.

But the situation is not so simple. Because there is merge () and split (), we may (temporarily) have 2 completely different sub-tables that are the same name. If the system hangs up at this unfortunate moment, 2 sub-tables may exist on disk simultaneously, then the arbiter that determines which child table is "correct" is the metadata information. To differentiate between different versions of the same sub-table, we added a unique region Id to the child table name.

In this way, the final form of our child table identifier is: Table name + first keyword +region Id. Here is an example, the table name is hbaserepository, the first keyword is the w-nk5ynz8tbb2uwfirjo7v==,region ID is 6890601455914043877, so its unique identifier is:

Hbaserepository, w-nk5ynz8tbb2uwfirjo7v==,6890601455914043877

Meta data table

We can also use this identifier as a row label for different sub-tables. The metadata for the child tables is then stored in another child table. We call this table that maps the child table identifier to the physical child table server location as a metadata table.

The metadata table may grow and can be split into multiple sub-tables. In order to locate the various parts of the metadata table, we keep the metadata of all the metadata sub-tables in the ROOT table. The root table is always a child table.

At startup, the primary server immediately scans the root table (because there is only one root table, so its name is hard-coded). This may require waiting for the root table to be assigned to a child table server.

Once the root table is available, the primary server scans it to get all the metadata sub-table locations, and then the master server scans the metadata table. Similarly, the primary server may have to wait for all the metadata sub-tables to be assigned to the child table server.

Finally, when the primary server scans the metadata sub-table, it knows the location of all the child tables, and then assigns the child tables to the child table server.

The primary server maintains a collection of currently available child table servers in memory. It is not necessary to save this information on disk because the primary server is hung up and the entire system is hung up.

Unlike this, bigtable stores mapping information for "sub-tables" to "sub-table Servers" in Google's distributed lock server chubby. But we store this information in the meta-data table because there is no equivalent chubby in Hadoop.

Thus, each row of the metadata and root table "INFO:" Column family contains 3 members:

The 1.info:regioninfo contains a serialized Hregioninfo object.

The 2.info:server contains a serialized hserveraddress.tostring () output string. This string can be used in the Hserveraddress constructor.

3.info:startcode is a serialized long integer that is generated when the server is started by the child table. The child table server sends this integer to the primary server, and the primary server determines whether the metadata and the information in the root table are obsolete.

So, as long as the client knows the location of the root table, it does not have to connect to the primary server. The load on the primary server is relatively small: it handles the time-out of the child table server, scans the root table and the metadata sub-table at startup, and provides the location of the root table (as well as the load balancing between the individual child table servers).

HBase clients are quite complex and often need to combine the root table and the metadata sub-table to satisfy the user's need to scan a table. If a child table server is hung up, or if a child table that should otherwise be on it is missing, the client can wait and retry. The mapping information for the child table to the child table server is likely to be incorrect at startup, or when the child table server is recently hung up.

Conclusion:

1. child table servers provide access to child tables, and a child table is managed by only one child table server.

2. The child table server needs to "report" to the primary server.

3. If the primary server is hung, the entire system is hung up.

4. Only the primary server knows the current child table server collection.

5. The mapping of the child table to the child table server is stored in 2 special sub-tables, which are assigned to the child table server just like any other child table.

6. The root table is special, and the primary server always knows its location.

7. Consolidating these things is the client's task.

[Reprint] Detailed explanation of the simple database hbase in Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.