This article gives you a comprehensive understanding of the HBase database of all the knowledge points, worth collecting!

Source: Internet
Author: User
Tags zookeeper hadoop ecosystem

Welcome to the big Data and AI technical articles released by the public number: Qing Research Academy, where you can learn the night white (author's pen name) carefully organized notes, let us make a little progress every day, so that excellent become a habit!

I. The basic concept of HBase: A column-based database

In the Hadoop ecosystem, HBase is on the upper level of HDFs (Hadoop Distributed File System) and does not depend on mapreduce, so what if there is no NoSQL database like hbase? Traditional relational database is limited by storage data, and its distributed structure has no more than 100 nodes, for example, distributed Oracle Database can only deploy 100 nodes and so on. Then in the context of the current massive data is the emergence of the column database, and the most common column database has two kinds: 1, HBase 2, Cassandra. A column database, as the name implies, stores data by column, meaning that the fields in the HBase table can be dynamically incremented, so the HBase database is a NoSQL database.

Ii. the relationship between HBase and HDFs and Hive/pig:

Because HDFs is used as a distributed file system for storing data, it does not support real-time access/random read/write, and HBase database supports implementation of access/random read/write, so hbase is mainly used for online data querying, HDFs is mainly used for data storage, and hive/ Pig as the data analysis engine, because the bottom-dependent mapreduce, with high latency characteristics, it is mainly used for offline data query.

Iii. basic knowledge of hbase tables:

1. Tables: tables are used to store and manage data, and tables consist of rows and columns.

2, line key: English Rowkey, not unique and not empty, as an index of the HBase table, features: The same row key as a record, row keys sorted in dictionary order.

3, Column family: The collection of columns, column family is defined when creating a table, such as: Create ' students ', ' info ', ' grade ', where info and grade are two column families, and students is the name of this table, columns are added dynamically when adding records.

4. Timestamp: A property of the column.

5. Cells: You can store multiple data, each with a timestamp attribute and version characteristics (data is distinguished by timestamp), which is unique to the HBASE table structure, whereas in a relational database, cells can store only one data.

6, the record in the HBase table is divided into region by the row key, a row key is a region, different region is distributed on different regionserver, the query of the table is converted to multiple Regionserver parallel query, By sacrificing storage space for time performance, HBase is ideal for large data second-level simple queries.

7, the region consists of multiple stores, each store a column family, the store consists of a memstore and 0 to multiple storefile, Memstore save the latest batch of data update operations, in the process of hbase writing data, is to write the data to the Memstore.

(Region is the smallest unit of distributed storage and load balancing, hfile is the smallest unit stored)

Iv. hbase Table:

Five, HBase table features:

1, big: A table can be made up of billions of rows, millions of columns.

2. Column-oriented: HBase table saves data by column.

3. Sparse: The empty columns of the HBase table do not occupy storage space.

4. No mode: hbase tables may have distinct columns, because columns are added dynamically when records are added.

5, data type single: only string this data type.

Vi. the system structure of HBase:

Hmaster:1, allocate region for Regionserver.

2, responsible for regionserver load balance.

3. Discover the failed regionserver and redistribute the region on it.

4, receive the client's request: The HBase table for the increase and deletion check operation.

Regionserver:1, maintaining region, processing client IO requests to region.

2, responsible for the segmentation of the region too large.

3, report the heartbeat information to zookeeper regularly.

Zookeeper:1, save the structure information of the HBase cluster, root table, meta table.

2, real-time monitoring regionserver and notify to Hmaster.

3. Implement the HA function of HBase

(HBase comes with a zookeeper)

Vii. Installing and configuring HBase:

1. Installation: TAR-ZXVF hbase-1.3.1-bin.tar.gz-c ~/training

2. Configure HBASE_HOME environment variable: Export hbase_home=/root/training/hbase-1.3.1

Export path= $HBASE _home/bin: $PATH

Viii. installation mode for HBase: similar to Hadoop

1, Local mode: The machine does not virtual out any node, only Hmaster, no regionserver, data stored locally, modify two configuration files: hbase-env.sh and Hbase-site.xml.

2, pseudo-distributed mode: single-machine virtual out of multiple nodes, with all the functions of hbase, modify two configuration files: hbase-env.sh and Hbase-site.xml.

3, fully distributed mode: At least three machines above, modify three configuration files: hbase-env.sh, Hbase-site.xml and Regionservers.

(one regionservers more than pseudo-distributed mode)

Add: HTTP service port for HBase: 16010

Ix. the reading and writing process of HBase:

1, write process: The record in HBase table is divided into region by row key, different region is distributed on different regionserver, region is composed of multiple stores, each store has a column family, And the store is composed of a memstore and 0 to multiple storefile, data written to Memstore, Memstore save the latest batch of data update operation, when Memstore Save (128M), Will overflow to disk form StoreFile file, when the number of storefile files reached a certain threshold will be merged into a storefile file, when the StoreFile file size is greater than 256M, the region will automatically split, By Hmaster assigned to other Regionserver, the final storefile file generates a 128M hfile file saved to Datanode.

2, read process: The client sends a request to Hmaster, from the zookeeper access to the root table (-root-) to obtain the table meta-information, Access the meta-table (. Meta.) to obtain the meta-information for region, go to region to find data from Memstore, and if not, look for data from StoreFile.

(a sentence summarizing the reading and writing process of HBase: Addressing access to zookeeper, data read and write Regionserver)

X. Filters on HBase: Implementing Complex queries

Xi. the input to Mapreduce:map on HBase is a record in hbase, and the output of reduce is a record in hbase.

12. HBase ha: Start a hmaster:hbase-daemon.sh start master separately.

Li Jinze Allenli, Tsinghua University in the master's degree, Research direction: Big data and artificial intelligence.

This article gives you a comprehensive understanding of the HBase database of all the knowledge points, worth collecting!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.