The-hbase of the Big Data Learning series

Last Update:2018-10-02 Source: Internet

Author: User

Tags hadoop mapreduce hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop ecosystem

Zookeeper responsible for reconciling hbase must rely on zookeeper

Flume Log Tool

Sqoop is responsible for data conversion from HDFS DBMS to relational database conversion
Big Data Learning Group 119599574

About HBase

Hadoop database

is a high-reliability, high-performance, column-oriented, scalable, real-time read-write distributed database
Using Hadoop HDFs as its file storage system, Hadoop MapReduce is used to process massive amounts of data in HBase, using zookeeper as its Distributed system service
Used primarily to store unstructured and semi-structured loose data (NoSQL databases)

HBase Data Model

ROW KEY

Deciding on a row of data
Sort by Field order
Row Key can store only 64k of byte data

Column Family Column Family &qualifier columns

Each column in the HBase table belongs to a column family, and the column family must be given as part of the table schema definition as create ' test ';
Column names are prefixed with column families, and each column family can y9uo multiple column member columns, such as Test:testfirst, and new column members can then be added on demand and dynamically;
Permissions control, storage, and tuning are all at the column family level.
HBase stores data from the same column family in the same directory, saved by several files

Timestamp time Stamp

In HBase each cell storage unit has multiple versions of the same data, distinguishing between each version based on a unique timestamp, and different versions of the data are sorted in reverse chronological order, with the most recent version of the data in front.
The type of timestamp is a 64-bit integer.
The timestamp can be assigned by HBase (automatically when the data is written), at which time the timestamp is the current system time that is accurate to milliseconds.
Timestamps can also be explicitly assigned by the customer, and if the application wants to avoid data version conflicts, it must generate its own unique timestamp.

Cell cells

is determined by the intersection of the row and column coordinates;
The cell is versioned;
The contents of the cell are an unresolved array of bytes;

The only unit determined by {row key, column (=<family> +<qualifier>), version}.
The data in the cell is of no type and is all stored in bytecode form.

HLog (Wal log)

The Hlog file is an ordinary Hadoop Sequence file,sequence The key is the Hlogkey object, the Hlogkey records the attribution information written to the data, in addition to table and region names, but also includes Sequence number and Timestamp,timestamp are "write Time", the starting value of sequence is 0, or the last time the file system was deposited in sequence.
The value of HLog Sequecefile is the KeyValue object of HBase, which corresponds to KeyValue in hfile.

HBase schema

Client

Includes an interface to HBase and maintains the cache to expedite access to hbase

Zookeeper

Ensure that there is only one master in the cluster at any time
Store address entry for all region
Monitor the online and offline information of region server in real time. and notify Master in real time
Storing the schema and table metadata for HBase

Master

Assigning region servers to region
Responsible for load balancing of Region server
Find the failed region server and reassign the region on the ride
Manage user additions and deletions to table

Regionserver

Region server maintains region, processing IO requests to these region
Region server is responsible for slicing the region that has become too large during operation

Region

HBase automatically divides the table horizontally into regions, where each region saves a contiguous piece of data in a table
Each table starts with only one region, and as the data is inserted into the table, the region grows, and when it grows to a threshold, the region is divided into two new region (fission)
As the rows in the table grow, there will be more and more region. Such a complete table is saved on multiple regionserver.
Big Data Learning Group 119599574

Memstore and StoreFile

A region is made up of multiple stores, and a store corresponds to a column family (CF)
The store includes the in-memory Memstore and the StoreFile write operation that is located on the disk first written to Memstore, when the data in the memstore reaches a certain threshold, Hregionserver will start the flashcacher process to write storefile, each write to form a separate storefile
When the number of storefile files increases to a certain threshold, the system merges (minor, major compaction), and version merging and deletion works during the merge process major form a larger storefile
When the size and quantity of all storefile in a region exceed a certain threshold, the current regjion is divided into two and assigned to the corresponding Regionserver server by Hmaster, which can achieve load balancing
The client retrieves the data, now Memstore find, find StoreFile

Hregion

Hregion is the smallest unit of distributed storage and load balancing in HBase. The smallest unit means that different hregion can be distributed across the Hregion server.
Hregion consists of one or more stores, each store a column family
Each store is made up of one memstore and 0 to more storefile. The storefile is stored in hfile format on HDFs.

The-hbase of the Big Data Learning series

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More