Hadoop cluster (phase 12th) _hbase Introduction and Installation

Last Update:2016-06-27 Source: Internet

Author: User

Tags compact

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

About HBase

HBase is a database of Apache Hadoop that provides random, real-time read and write access to large data, and is an open source implementation of Google's bigtable. The goal of HBase is to store and process large data, and more specifically, to handle thousands of large databases of rows and columns using only normal hardware configurations.
HBase is an open-source, distributed, Multi-versioned, column-oriented storage model. You can use the local file system directly or you can use Hadoop's HDFs file storage System. In order to improve the reliability of data and the robustness of the system, and the ability of hbase to handle large data, it is better to use HDFs as a file storage system. In addition, HBase stores loose data, specifically, HBase stores data between mappings (Key/value) and relational data. As shown, the data stored in HBase is logically a large table, and its data columns can be dynamically incremented as needed. The data in each cell can be in multiple versions (by timestamp ), and HBase has the feature of "provide storage down, provide arithmetic up".

HBase Architecture

The server architecture of HBase follows a simple master-slave server architecture that consists of the hregion server farm and the HBase master server. HBase Master manages all hregion servers, and all regionserver in HBase are coordinated through zookeeper and handle errors that may be encountered while the HBase server is running. The HBase Master server itself does not store any data in HBase, and the hbase logical table may be divided into several region and then stored in the Hregion server farm. A mapping from data to Hregion server is stored in HBase Master server. So the HBase architecture looks like this.

1) Client
HBase client uses the RPC mechanism of hbase to communicate with Hmaster and Hregion server, and for management class operations, client and Hmaster RPC, and for data read-write class operations, The client and the Hregionserver RPC.
2) Zookeeper
In addition to storing the address of the-root-table and Hmaster address in Zookeeper quorum, Hregionserver will register itself in ephemeral mode, So that hmaster can feel the health state of each hregionserver at any time. In addition, zookeeper also avoids a single point of hmaster problem.
3) Hmaster
Each hregionserver communicates with Hmaster, and Hmaster's main task is to tell each hregion server what hregion it maintains.
When a new hregionserver logs on to Hmaster, Hmaster tells it to wait for the data to be allocated. When a hregion crashes, hmaster marks the hregion that it is responsible for as unassigned, and then assigns them to other hregion servers.
HBase has solved the Hmaster single point of failure problem (SPFO), and HBase can start multiple hmaster, then it can be zookeeper to ensure that the system always has a master running. Hmaster is primarily responsible for the management of table and region in its functions, including:

Manage users ' additions and deletions to table operations
Manage Hregionserver load balancing and adjust region distribution
After split in region, responsible for the distribution of the new region
Responsible for region migration on failed hregionserver after hregionserver outage

4) Hregion
When the size of the table exceeds the set value, HBase automatically divides the table into different regions, each containing a subset of all rows. For the user, each table is a collection of data, distinguished by a primary key. Physically, a table is split into multiple pieces, each of which is a hregion. We use the table name + start/End primary key to distinguish each hregion, a hregion will save a table inside a contiguous data, from the start of the primary key to the end of the primary key, a complete table is saved on multiple hregion above.

Indicates that when table grows larger as the number of records increases, it gradually splits into multiple splits, becoming regions, a region represented by [Startkey, EndKey], Different region is managed by the Regionserver assigned to the response by master.

5) Hregionserver
All database data are generally stored on the Hadoop Distributed file system, users through a series of hregion servers to obtain this data, a machine generally only run a hregionserver, And the hregion of each segment will only be maintained by a single hregion server. Hregion the server data storage diagram as shown.

Hregion Server is primarily responsible for responding to user IO requests and reading and writing data to the HDFs file system, which is the core module in HBase. Hregionserver internally manages a series of Hregion objects, each of which corresponds to a region,hregion in a table consisting of multiple hstore. Each hstore corresponds to the storage of a column family in the table, and you can see that each columnfamily is actually a centralized storage unit, so it's best to place a column with the common IO feature in a column family, This is the most effective.
Hstore storage is the core of HBAs storage, which consists of two parts, part is Memstore, part is storefiles. memstore* is sorted Memory Buffer, the user writes the data first will put into Memstore, when the Memstore full will be flush into a *storefile(the bottom is hfile), When the number of storefile files increases to a certain threshold, the compact merge operation is triggered, merging multiple storefile into one storefile, the merge process is versioned and data is deleted, so you can see that hbase actually only adds data, All updates and deletions are performed in the subsequent compact process, which allows the user's write operations to return immediately as soon as they enter memory, guaranteeing the high performance of HBase IO. When the Storefiles compact, will gradually become more and more large storefile, when the size of a single storefile exceeds a certain threshold, will trigger the split operation, at the same time, will split the current region into 2 region, The parent region will be offline, and the newly split 2 children region will be hmaster assigned to the response hregion server, allowing the original 1 region pressure to be diverted to 2 region. Describes the process of compaction and split.

After understanding the basic principles of the above-mentioned hstore, we must understand the Hlog function, because the above Hstore in the system is not a problem under the premise of normal operation, but in a distributed system environment, can not avoid system error or downtime, once Hregion server unexpectedly quits, The memory data in the Memstore will be lost, which requires the introduction of Hlog. Each hregionserver has a Hlog object, Hlog is a class that implements the write Ahead log, writes a copy of the data to the Memstore file each time the user operation writes Hlog, and the Hlog file periodically scrolls out the new one. and delete the old file, when the Hregionserver unexpectedly terminated, Hmaster will be aware through zookeeper, Hmaster will first deal with the remaining hlog files, the different region of the log data is split, Placed in the corresponding region of the directory, and then re-distribution of the failed region, to receive these region of the hregionserver in the loadregion process, will find that there is a history hlog need to deal with, so will replay Hlog the data into Memstore, then flush to Storefiles to complete the data recovery.
6) HBase storage format
All data files in HBase are stored on the Hadoop HDFs file system, including the two file types mentioned above:

hfile HBase in the keyvalue data storage format, hfile is a hadoop binary format file, in fact StoreFile is the hfile to do a lightweight packaging, that is storefile the bottom is hfile.
The storage format of the Wal (Write Ahead Log) in Hlogfile,hbase, which is physically the sequence File of Hadoop

7) root table and meta table
The regions metadata for the user table is stored in the. Meta. table, with the increase in region, the. Meta. The data in the table also increases and splits into multiple regions. In order to locate. Meta. The location of each regions in the table, the metadata for all the regions in the. Meta. Table is saved in the-root-table, and the location information for the-root-table is finally recorded by zookeeper. Before all clients access user data, it is necessary to first access the zookeeper to obtain the-root-location, and then the Azimuth-root-table. Meta. The location of the table, based on the information in the. Meta. Table, determines where the user data resides, and the-root-table is never split. It has only one region, which guarantees that a maximum of three jumps can be used to locate any region. To speed up access,. Meta. The regions of the table are all saved in memory, if the. META. Each row in the table accounts for approximately 1KB in memory, and each region is limited to 128M, the three-tier structure can hold the number of regions (128M/1KB) * (128/1KB) =2^ 34 of them.

HBase Data Model

HBase is a distributed database similar to BigTable, which is a sparse, long-term storage (on a hard disk), multidimensional, sorted mapping table. The table Users store data in a table, each row has a sortable primary key and any number of columns. Because it is a factor store, each row of data in the same table can be made up of distinct columns. The format of the
column name is <family>:<qualifier> (< column family >:< qualifier >), which is made up of strings. Each table has a collection of columns (family), which are fixed and can only be changed by altering the table structure. However, the value of the qualifier (qualifier) can be changed in relation to each row. HBase stores the data in a column cluster under the same directory, and the write operation of HBase, each line is an atomic element that can be locked. HBase All database updates have a

1) Logical Model
You can think of a table as a large mapping relationship, and you can locate specific data by row health, row health + timestamp, or row health + column (column family: column modifier). Because HBase is a sparse store of data, some columns can be blank. As shown in the table below, a logical view of the data of the Www.cnn.com Web site is given, with only one row of data in the table, and the row uniquely identified as "com.cnn.www", with a timestamp association corresponding to each logical modification of the row of data. There are four columns in the table: contents:html, anchor:cnnsi.com, anchor:my.lock.ca, Mime:type, and each row is prefixed by the way it belongs to the column cluster.

A row health is a unique identifier for a data row in a table and serves as the primary key for retrieving records. There are only three ways to access rows in a table in HBase: Access by a row, a range of given rows, and a full table scan. The line health can be any string (maximum length 64KB) and stored in dictionary order. For those rows that are often read together, the key values need to be carefully designed so that they can be stored together.

2) Conceptual model
HBase is a sparse row/column matrix stored in columns, and the physical model actually cuts a row in the conceptual model and stores it according to the column family, which must be kept in mind when designing data and developing programs. The logical view above should behave like this when physically stored:

You can see that the null value is not stored, so "contents:html" with a query timestamp of T8 will return NULL, the same query timestamp is T9, and the "anchor:my.lock.ca" item also returns NULL. If no timestamp is specified, the most recent data for the specified column should be returned, and the newest values are first found in the table because they are sorted by time. Therefore, if you query "contents" without specifying a timestamp, you will return the T6 data, which has the advantage that you can add new columns to any of the column families in the table at any time without needing to be described beforehand.

HBase Installation

HBase is also installed in three modes: stand-alone mode, pseudo-distribution mode, and fully distributed mode, where only full distribution mode is described. The premise first step : Download the installation package, unzip it to the appropriate location, and assign the permissions to the Hadoop user (the account running Hadoop)
Download here is the Hbase-0.94.6,hadoop cluster using 1.0.4, unzip it to/usr/local and rename it to HBase

sudocp hbase-0.94.6.tar.gz /usr/localsudo tar -zxf hbase-0.94.6.tar.gzsudo mv hbase-0.94.6 hbasesudo chown-R hadoop:hadoop hbase

Step Two : Configure the relevant files
(1) Configure hbase-env.sh, the file in/usr/local/hbase/conf
Set the following values:

exportJAVA_HOME=/usr/local/java/jdk1.6.0_27 #Java安装路径export HBASE_CLASSPATH=/usr/local/hadoop/conf #HBase类路径export HBASE_MANAGES_ZK=true#由HBase负责启动和关闭Zookeeper

(2) Configure Hbase-site.xml, the file is located in/usr/local/hbase/conf

<property> <name>hbase.master</name> <value>master:6000</value> </property> <property> <name>hbase.master.maxclockskew</name> <value>180000</value> </property> <property> <name>hbase.rootdir</name> <value>hdfs://master:9000/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>master</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/home/${user.name}/tmp/zookeeper</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property>

which Hbase.master is the server and port number specified to run Hmaster, Hbase.master.maxclockskew is used to prevent regionserver startup failure caused by time inconsistency between hbase nodes, and the default value is 30000;hbase.rootdir specified Storage directory for HBase Hbase.cluster.distributed set the cluster in distributed mode, Hbase.zookeeper.quorum set the hostname of the Zookeeper node, it must have an odd number of values; Hbase.zookeeper.property.dataDir Set zookeeper directory, the default is/tmp,dfs.replication set the number of data backup, cluster node less than 3 needs to be modified, this test is a node, so modified to 1.

(3) Configure Regionservers, the file is located in/usr/local/hbase/conf
Set up the machine that is running HBase, this file configuration is similar to slaves in Hadoop, one line to specify a machine, this experiment only with one machine, set master.

(4) Set hbase environment variable, file is located in/etc/profile
Add at the end of the file:

#hbase EnvexportHBASE_HOME=/usr/local/hbaseexportPATH=$PATH:$HBASE_HOME/bin

Make it effective: source/etc/profile
Step three : Run the test

After starting Hadoop, enter start-hbase.sh on the terminal to see the running process:

Close: stop-hbase.sh

Hadoop cluster (phase 12th) _hbase Introduction and Installation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More