HBase Introduction 2

Last Update:2018-06-05 Source: Internet

Author: User

Tags columnar database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is HBase? HBase is a sub-project in ApacheHadoop. HBase relies on Hadoop's HDFS as the basic storage unit. By using hadoop's DFS tool, we can see the structure of these data storage folders, you can also use the MapReduce framework (algorithm) to operate HBase, as shown in:

HBase Introduction 2 What is HBase? HBase is a sub-project in Apache Hadoop. HBase relies on Hadoop's HDFS as the basic storage unit. By using hadoop's DFS tool, you can see the structure of these data storage folders, you can also use the Map/Reduce framework (algorithm) to operate HBase, as shown in:

HBase Introduction 2

What is HBase?
HBase is a sub-project in Apache Hadoop. HBase relies on Hadoop's HDFS as the basic storage unit. By using hadoop's DFS tool, you can see the structure of these data storage folders, you can also use the Map/Reduce framework (algorithm) to operate HBase, as shown in:

HBase also includes Jetty in the product and uses embedded methods to start Jetty during HBase startup. Therefore, you can manage HBase on the web interface and view the current running status, very lightweight.

Why HBase?
HBase is a database suitable for storing unstructured data, unlike general relational databases. the so-called unstructured data storage means that HBase is column-based instead of Row-based, which facilitates reading and writing your big data content.

HBase is a data storage method between Map Entry (key & value) and DB Row. This is a bit similar to the popular Memcache, but it is not just a simple key that corresponds to a value. You may need to store the data structure of multiple attributes, however, there are not so many associations in traditional database tables. This is called loose data.

Simply put, you can create a table in HBase as a large table, and the attributes of this table can be dynamically increased as needed, there is no association query between tables in HBase. You only need to tell your data to the column families stored in HBase. You do not need to specify its specific types: char, varchar, int, tinyint, text, and so on. However, you must note that HBase does not include functions such as transactions.

Apache HBase is very similar to Google Bigtable. A data row has an selectable key and any number of columns. Tables are loosely stored. Therefore, you can define different columns for rows. This feature is very useful for large projects and simplifies the design and upgrade costs.

Columnar Database

Columnar DatabaseA database stores data in a column-related storage architecture. It is mainly suitable for Batch Data Processing and ad hoc queries. It corresponds to a row-based database. Data is allocated by row-related storage architecture, which is mainly suitable for processing small batches of data and is often used for online transaction data processing.

Description

The database stores data as rows and columns in the form of a two-dimensional string, for example, the following table:

EmpId	Lastname	Firstname	Salary
1	Smith	Joe	40000
2	Jones	Mary	50000
3	Johnson	Cathy	44000

This simple table includes the employee code (EmpId), Name field (Lastname and Firstname), and Salary (Salary ).

This table is stored in the computer memory (RAM) and storage (hard disk. Although the memory is different from the hard disk mechanism, the computer's operating system is stored in the same way. The database must store the two-dimensional table in a series of one-dimensional "bytes", written by the operating system to the memory or hard disk.

A row-based database stores data values in a row, stores the data in the next row, and so on.

      1,Smith,Joe,40000;

      2,Jones,Mary,50000;

      3,Johnson,Cathy,44000;

A columnar database stores data values in a column together, stores the data in the next column, and so on.

      1,2,3;

      Smith,Jones,Johnson;

      Joe,Mary,Cathy;

      40000,50000,44000;

This is just a simplified statement. In addition, partitioning, indexing, caching mechanisms, views, online analysis of multi-dimensional datasets, and transaction systems such as pre-written logs and multiversion concurrency control all play a role in the real application environment. Generally, systems that focus on online transaction processing (OLTP) are more suitable for row databases, while systems that focus on Online Analytical Processing must find an appropriate balance between row databases and column databases.

Features

Because the hard disk addressing time is slower than the running speed of other components on the computer, hard disk access performance under the same workload is often used to compare row databases and column databases. Generally, sequential Data Reading is faster than random access [1]. In addition, the improvement in hard disk addressing time is much slower than the improvement in CPU speed (see Moore's Law). This situation is likely to continue for a period of time in systems that use hard disks as storage media. The following briefly lists the considerations for selecting a row database or a column database. Of course, if the data can be fully stored in the memory, the performance of the memory database will be better.

1. It is more effective to organize data by column when only a few columns need to be aggregated. In this way, you only need to read part of the data, which is faster than reading all the data.

2. When you only need to modify a column value, the column-based data organization method is more effective. Because data in a column can be directly found and modified, it is irrelevant to other columns in the row.

3. the row-based data organization method is more effective when multiple columns of data in a row are required. If there is not much data in the row, all the data in the row can be obtained by one hard disk addressing.

4. when adding row data, if each column has a value, the row-based data organization is more effective, because only one hard disk addressing is required to write all the data in the row.

In practical applications, the row-oriented data storage architecture is more suitable for OLTP-frequent interactive transactions. The column-oriented data storage architecture is more suitable for massive data volumes such as OLAP (such as data warehouses) (up to terabyte: 1 TB = 1000 GB )) for scenarios with limited and complex queries.

Benefits of column storage:

1. Because the selection rules in the query are defined by columns, the entire database is automatically indexed;

2. Data aggregation storage for each field is stored by column. When querying only a few fields, the data read can be greatly reduced;

3. Data aggregation storage of a field makes it easier to design a better compression/decompression algorithm for this clustering storage.

This section describes the differences between traditional row-store and column-store:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More