Hadoop White Paper (2): Introduction to Distributed Database HBase

Source: Internet
Author: User
Keywords Hadoop
Tags access business data design designed for development different distributed

HBase is a column-oriented distributed database. HBase is not a relational database, and its design goal is to solve the limitation of the theory and implementation of relational database in processing massive data. The traditional relational database was designed for the trading system in the 70 's to satisfy the data consistency (ACID), and did not consider the scalability of data scale expansion, and the reliability of single point system failure. Although after many years of technology development, has produced some to the relational database patching (parallel database), however limited by the theory and the implementation constraint, the extensibility has never exceeded 40 server nodes. HBase from the outset is designed for terabyte to petabyte levels of massive data storage and high-speed reading and writing, which can be distributed on thousands of common servers and can be accessed by a large number of concurrent users.

HBase has been used by more and more online service companies since the first commercial start in 2008. The biggest is Facebook's new online instant messaging system that integrates Email, SNS, Chat and short messages.

Look at this business like China Mobile?

Characteristics and advantages of distributed database HBase

High scalability

HBase is the true meaning of linear horizontal expansion. The amount of data accumulates to a certain extent (configurable), the HBase system automatically splits the data horizontally and assigns different servers to manage the data. This data can be spread to thousands of ordinary servers. This can be done on the one hand by a large number of ordinary servers to form a large-scale cluster to store large amounts of data (from several TB to dozens of PB of data). On the other hand, when the data peak is close to the system design capacity, it can simply enlarge the capacity by adding the server. This dynamic expansion process does not require downtime, the hbase system can run as usual and provide read and write services, fully achieve dynamic seamless without downtime capacity.

Performance

One of the purposes of HBase is to support high-speed read and write access to high concurrent user numbers. This is done in two ways. First, data rows are sliced horizontally and distributed across multiple servers, and access requests are dispersed to different servers when a large number of users are accessed, although the service capabilities of each server are limited, but thousands of servers are aggregated to provide extremely high-performance access. Secondly, HBase designed an efficient caching mechanism, which effectively improved the hit rate and improved the access performance.

High Availability

HBase is built on HDFS. HDFS provides the ability to automate data replication and fault tolerance. HBase log and data are stored on the HDFS, even if the current server failure (hard disk, memory, network, etc.), the log will not be lost, the data can be automatically recovered from the log. The HBase system automatically assigns additional servers to take over and recover the data. So once the data is successfully written, the data is guaranteed to be persisted and replicated, and the high availability of the entire system is guaranteed.

Data model and its characteristics

HBase is a column-oriented, sparse, distributed, and persistent multidimensional sort map (map). The index of the table is the row key, the column cluster name (column accessibility), the column keyword, and the timestamp; each value in the table is an unresolved array of bytes.

Column-oriented: means that all data in the same cluster is stored in a single file, thereby effectively reducing disk I/O overhead when reading and writing, and increasing the compression ratio because similar data is stored together. The compressed data capacity usually reaches the original 1/3 to 1/5, which saves a lot of storage space.

Multidimensional table: This is a great extension of the traditional two-dimensional relational table. Traditional two-dimensional tables have two dimensions: rows and columns. Columns must be fixed in the design of the table structure, and rows can be dynamically incremented, meaning that there is a dimension that can be dynamically changed. HBase Multidimensional table has four dimensions, the column needs to be defined in advance in the design of table structure, and row, column, Time dimension can be dynamically increased. In other words, there are three dimensions that can be dynamically changed. This structure is ideal for describing data with nested relationships. In addition, dynamic additions and deletions of the column's ability to bring convenience to many businesses, especially these business is constantly evolving, the need for the column field is also constantly increasing, multidimensional table structure can be changed at any time to adapt to business development needs.

Sparse tables: Because the columns of a multidimensional table can be dynamically increased, the data for the same columns in different rows is bound to be empty, which means that the table is sparse. Unlike traditional relational databases, the HBase does not store empty values and only holds content table cells (cell), so it can support oversized sparse tables without any overhead. This has also brought about a great change in the concept of the traditional table structure design.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.