A brief analysis of hbase principle

Source: Internet
Author: User

1. Overview

HBase is a distributed, column-oriented, open-source kv database.

Launched in 2006, it is BigTable's cottage product and became a sub-project of Hadoop in 07. became Apache's top project in 10.

the involvement of many communities has made hbase a gradual improvement and used in many companies.

The stability of hbase has always been a problem, but it's still a lot to use. The main reason is that the actual storage model can be correlated with the actual business matching degree.

2. HBase Features

---db of type Key-value

It is only suitable for adding and removing a single key, and scanning the library (sweeping the whole library or sweeping a part of the range).

The data is stored in the management entity in the Order of the Dictionary of key (the storage method is not complex).

-----Column Storage

To distinguish traditional databases, the schema must be perfect when the table is built, and HBase has few schema restrictions.

Because HBase originates from BigTable, which stores Web page data, because of the sparse nature of Web page data (sparse key-value storage), there is a special need to add fields freely. Semi-structured data is very useful

----Linear expansion

A bit compared to the traditional database, which is the characteristics of any distributed database system;

Can handle the data on p

----High Availability

Designed to run on inexpensive PCs with no single point of issue

----Strong consistency

Different from final consistency ha; strong consistency requires low latency.

3. Data Model

Line: All data corresponding to the same key, no limit on the number of rows

Column families:

---Similar column data is usually divided into a column family (how to divide it?). )

Limited number of---, specified when the table is built, different dynamic increases

---can store any number of columns

Columns: column names are determined at write time, and the number can be many

Cell and timestamp (version)

---each cell can have any number of versions, referring to the unit that stores the data;

---can have more than one version in each cell; You can keep multiple versions of each column family when you create a table

Table: (Abstract data collection, no association between tables)

---is similar to a traditional DBMS table, stored in the dictionary order of row row key values (from small to large), the data is organized into several region, that is, a table is cut into several region;

Region: (HBase is the smallest unit of load balancing and scheduling and the reason it can provide distributed services)

The table splits the storage cells, and the lines inside the region are ordered.

4. API

Put/get: A read or write operation for a key, which can be written independently of a column

Scan: Sequential sweep library, hbase native API interface; A single client can scan

Mapreduce: Concurrent sweep of the library, through the Mr Task execution;

Bulk Load: A fast way to import large quantities of data into the storage method; The format that hbase can generate is generated based on the data format that HBase stores, and then batched into the warehouse

Replication: The use of Journal (Hlog) to achieve data backup, to maintain the robustness of the data, hbase continuous improvement;

5. System compositionthree main components: Masterserver/regionserver/zookeeper


Zookeeper cluster: Distributed lock Service, which provides multi-machine coordination function without single point problem in distributed environment;

Provide Notification Services (those services are still alive?) ), positioning root region and other functions;

Masterserver: Transaction control for load balancing, error recovery, meta data

Regionserver:slave node, read and write data operations, complete the tasks of master distribution, such as split, etc.

HBase and Hadoop are deployed together, and Namenode and Regionserver are deployed together, with the benefit of being able to share hdfs and improve read and write performance when local management performance is good;

Master is deployed separately;

6. Data organization and storage method

Data organization:

Level three table structure: Root, Meta, usertable

Rootregion table: Determined by zookeeper, at most one region, record the location of each region of the meta-table

Meta table: May contain multiple Region,meta tables record the location of each of the servers on each region of each table

In practice, the client caches the metadata of the meta table and updates when there is an error.

Storage mode:

Coefficient matrix-columnstore, in which the HBase table is divided into multiple columns, each column is stored as a file, and many columns of data belong to the same row, but not together.

Based on HDFs: (HDFs only supports append write, does not support random write, it is complex to implement a randomly updated DB on this non-changing FS)

Based on inexpensive hardware

Supports high write data (write performance is higher than read performance)

LSM Tree: The management of new data; The traditional database of B + numbers, and the process of constantly being incorporated into a tree by small trees when processing frequent writes

Data unit:

The sparse structure data is essentially as follows: Row key + column + Quallifier + version

Atomic data for sequential dense storage

Each row of data consists of several atomic data

Underlying file structure:

Hfile: Based on block storage data, block index resident memory, default blocksize=64k, when the table is established, you can specify the blocksize size;

7. Main operation

Flush:

Memory capacity is limited, so memory data needs to be flush to disk periodically

Each time you flush, each column family of each region produces a hfile

The read operation is that Regionserver will merge multiple hfile data together and select the data according to the version

Compaction:

Flush produces more and more hfile, need to be merged to control the number of files;

The old data is cleaned up;

Split:

Region data will continue to grow

Split is required to achieve load balancing

When split is complete, you must notify master before it actually takes effect.

8. Error recovery (execution mechanism)Master perceives the status of each regionserver through zookeeperError Recovery via Hlog (HDFS) When Regionserver is hung up; each regionserver maintains a hlogMaster stripped the relevant data to different region storage directories and the new Regionserver for data reconstruction.




Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

A brief analysis of hbase principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.