Detailed explanation of a simple database in Hadoop hbase

Source: Internet
Author: User
Tags compact

HBase is a simple database in Hadoop. It is particularly similar to Google's bigtable, but there are many differences.

Data Model

The HBase database uses a data model that is very similar to bigtable. Users store many rows of data in a table. Each data row includes a sortable keyword, and any number of columns. The tables are sparse, so rows in the same table may have very different columns, as long as the user prefers to do so.

The column name is the "< family name >:< tag >" form, where < family name > and < label > can be arbitrary strings. A table's < family name > collection (also called the "column Family" collection) is fixed unless you use administrator privileges to change the table's column family. However, you can add new < tags > at any time. HBase stores data by column family on disk, so all items in a row family should have the same read/write style.

Write operations are row locked, you cannot lock multiple lines at a time. All writes to a row are by default atomic.

All database update operations have a timestamp. HBase for each data unit, only the latest version of the specified number is stored. Clients can query "the latest data from a certain time", or get all the data versions at once.

Conceptual Model

Conceptually, a table is a collection of rows, each containing one line keyword (and an optional timestamp), and some columns that may have data (sparse). The following example is a good illustration of the problem:

Physical Model

Conceptually the table is a sparse row/column matrix, but physically, they are stored in columns. This is one of our important design considerations.

The "conceptual" table above is physically stored in the following ways:

Please note that in the figure above, there are no cells that are empty. Therefore, the query timestamp of T8 "content:" returns NULL, the same query timestamp as T9, and the "anchor:" Value of "my.look.ca" returns NULL.

However, if you do not specify a timestamp, you should return the most recent data values for the specified column, and the most recent values are found first in the table, because they are sorted by time. Therefore, the query "contents:" And does not specify the timestamp, will return the T6 time data, query "anchor:" "my.look.ca" without specifying the timestamp, will return the T8 time data.

Example

To show how data is stored on disk, consider the following example:

The program first wrote the line "[0-9]", the column "Anchor:foo", then wrote the line "[0-9]", the column "Anchor:bar", and finally wrote the line "[0-9]", column "Anchor:foo". When the memcache is brushed to disk and the storage is tightened, the corresponding file may be as follows:

Row=row0, Column=anchor:bar, timestamp=1174184619081 row=row0, Column=anchor:foo, timestamp=1174184620720 row=row0, Column=anchor:foo, timestamp=1174184617161 Row=row1, Column=anchor:bar, timestamp=1174184619081 Row=row1, column= Anchor:foo, timestamp=1174184620721 Row=row1, Column=anchor:foo, timestamp=1174184617167 Row=row2, Column=anchor:bar , timestamp=1174184619081 Row=row2, Column=anchor:foo, timestamp=1174184620724 row=row2, Column=anchor:foo, timestamp =1174184617167 row=row3, Column=anchor:bar, timestamp=1174184619081 row=row3, Column=anchor:foo, timestamp= 1174184620724 row=row3, Column=anchor:foo, timestamp=1174184617168 row=row4, Column=anchor:bar, timestamp= 1174184619081 row=row4, Column=anchor:foo, timestamp=1174184620724 row=row4, Column=anchor:foo, timestamp= 1174184617168 row=row5, Column=anchor:bar, timestamp=1174184619082 row=row5, Column=anchor:foo, timestamp= 1174184620725 row=row5, Column=anchor:foo, timestamp=1174184617168 row=row6, Column=anchor:bar, timestamp=1174184619082 row=row6, Column=anchor:foo, timestamp=1174184620725 row=row6, Column=anchor:foo, timestamp=1174184617168 row= Row7, Column=anchor:bar, timestamp=1174184619082 row=row7, Column=anchor:foo, timestamp=1174184620725 row=row7, Column=anchor:foo, timestamp=1174184617168 row=row8, Column=anchor:bar, timestamp=1174184619082 row=row8, column= Anchor:foo, timestamp=1174184620725 row=row8, Column=anchor:foo, timestamp=1174184617169 Row=row9, Column=anchor:bar , timestamp=1174184619083 Row=row9, Column=anchor:foo, timestamp=1174184620725 row=row9, Column=anchor:foo, timestamp =1174184617169

Note that the column "Anchor:foo" is stored 2 times (but the timestamp is different), and the new timestamp is in the front (so the newest is always the first to be found).

hregion (Tablet) Server

For users, a table is a collection of data tuples sorted by the line keyword. Physically, a table is divided into multiple hregion (that is, a child table, a tablet). A child table is identified by the name of the table to which it belongs and the "First/end" keyword pair. A first/end keyword is a and the child table contains a row within the range of [,). The entire table consists of a collection of child tables, each of which is stored in the appropriate place.

All physical data is stored on the DFS of Hadoop, and a number of child table servers are used to provide data services, usually a single computer running only one child table server program. A child table is managed by only one child table server at a time.

When the client wants to perform an update operation, it first joins the associated child table server and submits the changes to the child table. The submitted data is added to the Hmemcache of the child table and the hlog of the child table server. Hmemcache stores the most recent updates in memory and serves as the cache. Hlog is a log file on the disk that records all the update operations. A commit () call to the client is not returned until the update is written to Hlog.

When providing the service, the child table first checks Hmemcache. If not, check the Hstore on the disk again. Each column family in a child table corresponds to a hstore, and a hstore includes hstorefile files on multiple disks. Each hstorefile has a structure similar to the B-tree, allowing for quick lookups.

We periodically call Hregion.flushcache () and write the contents of Hmemcache to Hstore files on disk, adding a new hstorefile to each hstore. Then empty the Hmemcache and add a special mark to the Hlog, indicating that the Hmemcache is flush.

At startup, whether or not the write operation is not applied in hlog after each child table checks the last Flushcache () call. If not, then all the data in the child table is the data in the Hstore file on the disk, and if so, then the subform hlog the update to the Hmemcache and then calls Flushcache (). The last child table deletes the Hlog and starts the data service.

Therefore, the less the call Flushcache (), the less the workload, and the more memory space the Hmemcache will take up, the more time Hlog will need to recover the data at startup. The more frequently the call Flushcache (), the less memory the Hmemcache consumes, and the faster Hlog recovers the data, but the cost of Flushcache () also needs to be considered.

The Flushcache () call adds a hstorefile to each hstore. Reading data from a hstore may have access to all of its hstorefile. This is time-consuming, so we need to periodically combine multiple hstorefile into a hstorefile by invoking Hstore.compact ().

Google's BigTable paper is a little vague about major austerity and minor tightening, and we've only noticed 2 things:

1. One flushcache () writes all updates from memory to disk. by Flushcache (), we shortened the log rebuild time to 0. Each flushcache () adds a hstorefile file to each hstore.

2. A compact () turns all the hstorefile into one.

Unlike BigTable, the hbase of Hadoop can shorten the time period of updating "commit" and "write log" to 0 (that is, "submit" must be written into the log). This is not difficult to achieve, as long as it does need to.

We can call Hregion.closeandmerge () to combine 2 child tables into one. The current version of the 2 child tables are in the "offline" state to merge.

When a child table is larger than a specified value, the child table server needs to call Hregion.closeandsplit (), dividing it into 2 new child tables. The new child table is escalated to master, where master determines which child table the server takes over. The segmentation process is very fast, mainly because the new child table maintains only a reference to the hstorefile of the old child table, a reference to the first half of the Hstorefile, and the second half of the reference. When the reference is established, the old child table is marked as "offline" and persisted until the new child table's austerity operation clears all references to the old child table, and the old child table is deleted.

Summary:

1. The client accesses the data in the table.

2. The table is divided into many child tables.

3. Child table Server maintenance, Client Connection child table server to access the row data in the scope of a child table key.

4. The child table also includes:

A Hmemcache, storing the most recently updated memory buffer

B Hlog, storing the most recently updated log

C Hstore, a group of efficient disk files. One hstore per column family.

master server for HBase

Each child table server maintains a connection to a unique primary server. The primary server tells each child table server which child tables should be loaded and serviced.

The primary server maintains the active tag of the child table server at any time. If the connection between the primary server and the child table server times out, then:

A The child table server "kills" itself and restarts in a blank state.

B The primary server assumes that the child table server is "dead" and assigns its child tables to other child table servers.

Note that unlike Google's bigtable, their child table servers can continue to service even if their connection to the primary server is broken. We must "tie" the table server to the primary server because we do not have an additional lock management system like bigtable. In BigTable, the primary server is responsible for allocating child tables, and the lock Manager (Chubby) guarantees the access of the child table server atoms. HBase uses only one core to manage all child table servers: the primary server.

BigTable There is nothing wrong with doing so. They all depend on a core network structure (hmaster or chubby), and as long as the core is still running, the entire system can run. Perhaps chubby has some special advantages, but this is beyond the HBase's current target range.

When the child table server "reports" to a new primary server, the primary server lets each child table server load 0 or more child tables. When the child table server is dead, the primary server marks these child tables as unassigned, and then tries to give other child table servers.

Each child table is identified by the table name and keyword range to which it belongs. Since the keyword range is contiguous, and the first and last keywords are null, it is sufficient to identify the keyword range with only the first keyword.

But it's not that simple. Because there are merge () and split (), we may (temporarily) have 2 completely different child tables that are of the same name. If the system hangs at this unfortunate moment, and 2 of the child tables may be on disk, the arbiter that determines which child table is "correct" is metadata information. To differentiate between different versions of the same child table, we also added a unique region Id to the child table name.

In this way, the final form of our child table identifier is: Table name + first keyword +region Id. Here is an example where the table name is hbaserepository and the first keyword is the w-nk5ynz8tbb2uwfirjo7v==,region ID is 6890601455914043877, so its unique identifier is:

Hbaserepository, w-nk5ynz8tbb2uwfirjo7v==,6890601455914043877

Meta data Table

We can also use this identifier as a row label for different child tables. As a result, the metadata of the child table is stored in another child. The table that we call this mapping child table identifier to the physical child table server location is a metadata table.

The metadata table may grow and can be split into multiple child tables. To locate the parts of the metadata table, we keep the metadata of all the metadata child tables in the ROOT table. The root table is always a child table.

At startup, the primary server scans the root table immediately (because there is only one root table, so its name is hard-coded). This may require waiting for the root table to be assigned to a child table server.

Once the root table is available, the primary server scans it for all the metadata child table locations, and then the primary server scans the metadata table. Similarly, the primary server may have to wait for all of the metadata child tables to be assigned to the child table server.

Finally, when the primary server scans the metadata child table, it knows the location of all the child tables, and then assigns the child tables to the child table server.

The primary server maintains a collection of currently available child table servers in memory. There is no need to save this information on disk because the primary server is dead and the entire system is dead.

BigTable Unlike this, it stores mapping information for "child tables" to "child table Servers" in Google's distributed lock server chubby. But we store this information in the metadata table because there is no equivalent chubby in Hadoop.

In this way, the "INFO:" column family for each row of metadata and root tables contains 3 members:

The 1.info:regioninfo contains a serialized Hregioninfo object.

The 2.info:server contains a serialized hserveraddress.tostring () output string. This string can be used for the Hserveraddress constructor.

The 3.info:startcode is a serialized long integer that is generated when the child table server is started. The child table server sends this integer to the primary server, and the primary server determines whether the metadata and the information in the root list are obsolete.

So, as long as the client knows the location of the root table, there is no need to connect to the primary server. The primary server has a relatively small load: It processes the child table server that is timed out, scans the root table and the metadata child tables at startup, and provides the location of the root table (and load balancing between the individual child table servers).

HBase clients are quite complex, and often need to combine root tables and metadata child tables to satisfy the user's need to scan a table. If a child table server is hung, or the child table that should be on it is missing, the client can only wait and try again. The mapping information for a child table to a child table server may be incorrect when it is started, or when a child table server is recently hung.

Conclusion:

1. The child table server provides access to the child table, which is managed by only one child table server.

2. The child table server needs to "report" to the primary server.

3. If the primary server is dead, the entire system is dead.

4. Only the primary server knows the current set of child table servers.

5. The mapping of the child table to the child table server is stored in 2 special children, and they are assigned to the child table server as other child tables.

6. The root table is special, and the primary server always knows its location.

7. Consolidating these things is the task of the client.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.