HBase Application Development Review and Summary series one: Overview hbase design specifications

Source: Internet
Author: User
Tags compact deprecated table definition

Overview

The author I contact research HBase also has half a year, although not on-depth and system, but at least is more addicted. As the Department of the Big Data Technology Pathfinder, the author also assume the responsibility of technology communication, so in the process of groping research is always constantly summing up and testing, along the way, slowly accumulated a number of things, finishing a bit, made a technical series of documents, temporarily called "HBase Application Development Review and summary." Although not what inscrutable technology, but in the spirit of open source and sharing, I am still happy to put it out of the article. In addition, the author believes that the "HBase authoritative guide" is a relatively good hbase technical books, recommended for everyone to read.

Here is an introduction to the catalogue of this series of documents:

Chapter I design code for HBase

This paper introduces the design specifications that are recommended in the development of hbase application, mainly for the development level.

Chapter II design specification for Rowkey line keys

This paper introduces some characteristics and design specifications of Rowkey, of course, the specific design of the line key, or to follow the specific business, and with the help of rich design experience.

Chapter III Rowkey line key generator

A Rowkey-line key generator is designed, which can be used to manually develop the line key generation strategy, serialize the policy file to a local file, or deserialize the local policy file into a policy object, which can dynamically generate row key information in bulk.

The fourth chapter of HBase Configuration Management class interface design

Design the HBase Configuration tool class, including how to load read HBase-related configuration files, build configuration objects, and so on.

The fifth chapter of HBase table information management interface design

Design HBase table Information management tool classes, including namespaces, table information, column family, and other management interfaces.

The sixth chapter of the HBase table write data interface design

Several hbase write data model classes have been designed, which make it easy to organize and write data to the HBase database during the development period.

The seventh chapter of HBase table reading data interface design

Several hbase reading data model classes are designed, and several data retrieval schemes are introduced, including batch retrieval, range retrieval, version retrieval and so on.

Eighth chapter the application design of hbase filter

This paper introduces several kinds of commonly used filters and the details that should be paid attention to when using filters, including page filter, prefix filter and so on.

Nineth Chapter HBase Lightweight ORM Design

Mimicking Hibernate's object mapping, a lightweight ORM design for hbase is not too much of a usability consideration, just a prototype test.

Tenth HBase table Data Browser

Comprehensive application of the above chapters, the design of an hbase table data Browser, with table information navigation, conditional paging query, multi-version query and other functions. As shown below:

1. HBase Design Code

In this talk about what design specifications is really some uppity, after all, I am also just a big data technology beginners, flatly dare not formulate what design norms, so please forgive my arrogance, this design norms, but I made to myself, and others have nothing to do.

Before, HBase official and a large number of high-skilled have summed up a part of the HBase design specifications, the author of the collection and collation, coupled with their own understanding and rich, collated a feel suitable for their own development should follow the norms.

The logical model associated with the table structure in hbase involves the following vocabulary: namespaces, tables, column families, columns, row keys, versions, and so on, which are all elements of building an hbase table. Based on these key words, the author states the relevant norms.

1.1. Namespace namespace Design

In layman's terms, namespaces can be thought of as table groups (similar to table spaces in Oracle), which are not fixed, can be divided according to business types, or can be divided by time periods. For example, for the Power Meteorological data table, can create a power meteorological namespace, named Dlqx, the electricity meteorological related tables are organized under this namespace. The advantage of introducing a namespace is that it facilitates the organization and management of tables.

The default namespace for HBase is defaults, and if you do not explicitly specify a namespace when you create the table, the table is created under the default namespace. If the table is subordinate to a non-default namespace, you must specify the namespace when referencing the table (for example, reading the table data), or you will get an error like "Cannot navigate to table", the full table name is in the format "namespace name: Table name", such as "Dlqx:system_user" If it is the default namespace, the full table name can also be omitted "default:", the direct spelling table name System_user.

The relationship between a namespace and a table can be represented by:

Namespaces are a one-to-many relationship between tables, that is, a namespace can contain more than one hbase table, but an hbase table can belong to only one namespace. When you create a table, if you do not specify a namespace (or if the namespace is empty), the HBase table is placed under the default namespace.

Also, before you delete a namespace, you must remove all hbase tables under this namespace, or you will not be able to delete this namespace.

1.2. Table Design

HBase has several advanced features that you can use when you design a table. These attributes do not necessarily relate to schema or row key design, but they define some aspects of table behavior.

1.2.1 Ideal HBase table

HBase, as a database of columns, is more adept at handling "tall and thin" tables than "short and fat", according to the official statement. The so-called "high and thin" refers to the number of columns in the table is small, but the number of rows is enormous, so that the table shows a high and thin image. The so-called "short and fat" refers to the table column of data, but the number of lines is limited, giving a kind of short and fat image, although HBase table is known to accommodate millions of columns, but that is only limited to the theoretical limit, in practical applications, please try to build "high and thin" table, while the number of columns to test, To avoid excessive impact on read and write performance.

1.2.2 Pre-Create partitions

By default, a region partition is created automatically when the HBase table is created, and when the data is imported, all HBase clients write data to this region until the region is large enough to slice. One way to speed up bulk write is by pre-creating some empty regions, so that when data is written to HBase, the data is load-balanced within the cluster according to the region partitioning situation.

1.2.3 Number of Column families

don't define too many of them in a single table. Column Family. Currently HBase is not able to handle more than two or three column family tables. Because a column family is flush, its neighboring column family will also be triggered flush as a result of the correlation effect, resulting in more I/O being generated by the system. So, according to the official recommendation, Create a column family in an HBase table.

1.2.4 Configurable data block size

hfile data block size can be set at the column family level. This data block differs from the HDFS data block. Its default value is 65,536 bytes, or 64KB. The data block index stores the starting key for each hfile data block. The block size setting affects the size of the data block index. The smaller the data block, the larger the index, which takes up more memory space. At the same time, the random lookup performance is better because the blocks loaded into the memory are smaller. But if you need better sequential scan performance, it's more reasonable to load more hfile data at a time, which means that the block should be set to a larger value. With the corresponding index smaller, you will pay a price for random read performance.

1.2.5 Data Block Cache

The data is put into the read cache, but workloads often fail to gain performance gains from it. For example, if a table or list of column families is accessed only by sequential scans or rarely accessed, you won't mind if get or scan takes a bit longer. In this case, you can choose to turn off the cache for those column families. If you are just performing many sequential scans, you will daoteng the cache multiple times and may misuse the cache to squeeze out the data that should be put into the cache for performance gains. If you turn off caching, you can not only avoid this, but also make more caches available to other tables and other column families of the same table.

1.2.6 Radical Cache

You can select a number of column families, giving them a higher priority (LRU cache) in the block cache. If you expect a column family to read more randomly than the other column family, this feature will sooner or later be used.

The default value of the In_memory parameter is false. Because HBase does not provide additional assurance that the column family is more aggressive than the other columns in the block cache, this parameter is set to true in practice and does not change too much.

When creating a table, you can place the table in the Regionserver cache with Hcolumndescriptor.setinmemory (true), guaranteeing that the cache will be hit when read.

1.2.7 bron Filter (Bloom filters)

The block index provides an efficient way to find the block of hfile that should be read when accessing a specific row. But its utility is limited. The default size of the hfile data block is 64KB, which cannot be resized too much.

If you are looking for a short line, indexing only on the starting row key of the entire data block will not give you fine-grained index information. For example, if your row occupies 100 bytes of storage space, a 64KB block of data contains (64 * 1024)/100 = 655.53 = ~700 rows, and you can only place the starting row on the index bit. The row you are looking for may fall in the range of rows on a particular block of data, but it is not definitely stored on that block. This is possible in a variety of cases, or the row does not exist in the table, or is stored in another hfile, even in the Memstore. In these cases, reading data blocks from the hard disk can lead to IO overhead and also misuse of the block cache. This can affect performance, especially if you face a huge data set and have a lot of concurrent read users.

The Bron filter allows you to do a reverse test of the data stored in each block of data. When a row is requested, first check the filter to see if the row is not in this block of data. The Bron filter either determines that the line is not in question, or answers it without knowing. That's why we call it reverse testing. Bron filters can also be applied to cells in rows. Use the same reverse test first when accessing a column identifier.

Bron filters are not without cost. Storing this extra level of index takes up extra space. Bron filters grow as their Index object data grows, so row-level filters take up less space than column-identifier-level filter filters. When space is not a problem, they can help you squeeze dry system performance potential.

You can open the fabric filter on the column family as follows:

HBase (Main) > Create ' mytable ',{name=> ' colfam1 ',bloomfilter=> ' rowcol '}

The default value for the Bloomfilter parameter is none. A row-level fabric filter with row open, column identifier-level filter with Rowcol open. Row-level filter to check if a particular row key does not exist in the data block, the column identifier-level filter checks whether the row and column identifier unions do not exist. Rowcol bron filters are more expensive than row Bron filters.

1.2.8 Time to Live (TTL)

Application systems often need to remove old data from the database. Since the database is difficult to exceed a certain size, many flexible approaches have traditionally been built into the database. For example, in twitbase you do not want to delete any tweets that are generated by the user during the use of the application system. These are user-generated data that will someday be useful when you perform some advanced analysis. However, you do not need to save all the tweets for real-time access. So a tweet earlier than a certain time can be archived to a flat file.

HBase allows you to set a TTL at the column family level within a few seconds. Data older than the specified TTL value is deleted at the next large merge. If you have multiple time versions on the same unit, versions older than the set TTL will be deleted. You can turn the TTL off or set its value to Int.max_value (2147483647) to keep it open forever (this is the default). You can set the TTL when you build the table as follows:

HBase (Main) > Create ' mytable ',{name=> ' colfam1 ',ttl=> ' 18000 '}

This command sets the TTL on the colfam1 column family to 18,000 seconds = 5 hours. Data for more than 5 hours in colfam1 will be deleted when the next big merge occurs.

1.2.9 data compression

The hfile can be compressed and stored on HDFS. This helps save hard disk IO, but compressing and decompressing data while reading and writing can drive up CPU utilization. Compression is part of the table definition and can be set when the table or schema changes. Unless you are sure that you will not benefit from compression, we recommend that you open the table for compression. The compression feature may be turned off only if the data cannot be compressed or if, for some reason, the CPU utilization of the server is limited.

HBase can use a variety of compression encodings, including Lzo, snappy, and gzip. LZO[1] and snappy[2] are two of the most popular. Snappy was released by Google in 2011 and was released shortly after the Hadoop and HBase project began to provide support. Prior to this, the Lzo encoding was selected. The Lzo native repository used by Hadoop is subject to GPLV2 copyright control and cannot be placed in any distribution of Hadoop and hbase; they must be installed separately. On the other hand, snappy has a BSD license (bsd-licensed), so it's easier to bundle with Hadoop and hbase distributions. The compression ratio of lzo and snappy is similar to the compression/decompression rate.

When you build a table, you can turn on compression on the column family as follows:

HBase (Main) > Create ' mytable ',{name=> ' colfam1 ',compression=> ' SNAPPY '}

Note that the data is compressed only on the hard disk. There is no compression in memory (Memstore or Blockcache) or network transmission.

Changing the compression coding should not happen very often, but if you do need to change the compression code of a particular column family, you can do it directly. You need to change the table definition to set a new compression encoding. When merging, the resulting hfile will all be compressed with the new encoding. This process does not require creating new tables and copying data. But you have to make sure that the old code library is not removed from the cluster until all old hfile are merged after the encoding has been changed.

1.2.10 Data Segmentation

In HBase, data is first written to the Wal log (HLog) and memory (Memstore) when it is updated, and the data in Memstore is sorted, and when memstore accumulates to a certain threshold, a new memstore is created. and add the old Memstore to the flush queue, which is flush to disk by a separate thread and becomes a storefile. At the same time, a redo point is recorded in the zookeeper to indicate that the changes before this time have persisted (minor compact).

StoreFile is read-only and can no longer be modified once it is created. So the update to HBase is actually a constant addition to the operation. When a storefile in a store reaches a certain threshold, a merge (major compact)is made, and changes to the same key are combined to form a large storefile, When the size of the storefile reaches a certain threshold, the StoreFile is split (split)and divided into two storefile.

Since updates to the table are constantly appended, the read request needs to be accessed by accessing all storefile and Memstore in the store, merging them according to Row key, since StoreFile and Memstore are all sorted. And StoreFile with an in-memory index, usually the merge process is relatively fast.

In practical applications, the major compact can be considered manually when necessary, merging the modifications of the same row key to form a large storefile. At the same time, the storefile can be set larger to reduce the occurrence of split.

1.2.11 Unit Time Version

HBase maintains three time versions of each unit by default. This property can be set. If you only need one version, it is recommended that you only maintain one version when setting up the table. This way, the system does not retain multiple time versions of the update unit. The time version is also set at the column family level and can be set when the table is instantiated:

HBase (Main) > Create ' mytable ',{name=> ' colfam1 ', versions=>1}

You can specify multiple properties for the column family in the same create statement, as follows:

HBase (Main) > Create ' mytable ',{name=> ' colfam1 ',versions=>1,ttl=> ' 18000 '}

You can also specify the minimum number of time versions that the column family stores, as follows:

HBase (Main) > Create ' mytable ',{name=> ' colfam1 ', versions=>5,

Min_versions=> ' 1 '}

It is also useful to set TTL on the column family at the same time. If all time versions of the current store are earlier than the TTL, at least min_version the latest version will remain. This ensures that your query and data are returned with a result that is earlier than the TTL.

1.3. columnfamily Column Family Design

A column family is a grouping of multiple columns, and the basis for grouping is not fixed. Although in theory hbase a table can create multiple column families, HBase officially recommends that a table not create more than one column family. After testing, the write and read efficiency of a single column family is much more efficient than in multiple column families. When stored, a column family is stored as a storefile, and multiple files corresponding to multiple column families can cause greater stress to the server when splitting. Therefore, it is recommended that a table create a family of columns.

the name of the column family should not be too long, because each column will be spelled with the family name when it is stored, and the long column family will waste more storage space.

When you delete a column family, both column and column value data under the column family are deleted.

When you create a table, you create at least one column family. After you create a table, you can add multiple column families.

Version versions are for the column family, and if a table has more than one column family, you can set a different version number for each column family. For example, a maximum of 5 versions are allowed for column family A, and column family B has a maximum of 3 versions.

1.4. Qualifier Column Design

An obvious difference between hbase and a traditional relational database is that when you create a table, you do not need to create columns, but you create them dynamically when you write data. And the empty columns do not really occupy storage space.

The column content is encapsulated as a KeyValue object from which you can get multiple information, as follows:

// Row Key String RowKey = bytes.tostring (Kv.getrow ()); // Column Family String family = bytes.tostring (kv.getfamily ()); // Column Name String qualifier = bytes.tostring (Kv.getqualifier ()); // column Values String value = bytes.tostring (Kv.getvalue ()); // Version number long timestamp = Kv.gettimestamp ();
1.5. Version Design

If a column family of a table involves multiple versions of the problem, you must specify Maxversions when you create the column family. Although the default version number for HBase is 3, if you do not explicitly specify when you create the table, you can still save only one version because HBase will assume that you do not want to enable the multi-version mechanism of the column family.

You can specify a version number when writing data, and if you do not specify a version number, the default version number, which is the timestamp, is used.

When reading data, if no version number is specified, only the latest version of the data, not the latest version number, is read.

1.6. HBase Naming conventions

Project

Description

Example

Name space

  • The combination of English words and Arabic numerals, where the word must be capitalized, and the first character must be English characters, cannot be a number.
  • It is not recommended to use connectors (underscores) to stitch multiple words, simple semantics can be a single word, complex semantics can be multiple words with the first letter stitching.
  • Limit the length to between 4~8 characters as much as possible.
  • Namespaces can generally be aligned with project names, organization names, and so on.
  • Build a namespace based on the project name: DLQX (first-letter splicing of Electrical meteorology), short and clear.
  • It is not recommended to have a long namespace name, for example, the following form is not recommended: User_info_manage, etc.

Table name

  • Using the combination of English words, Arabic numerals, connectors (_), where the word must be uppercase, and the first character must be English characters, not a number, the connector can be used to stitch multiple words.
  • Limit the length to between 8~16 characters as much as possible.
  • Try to use English words with definite meanings, but not the phonetic alphabet of Chinese characters or the combination of pinyin initials.
  • Table names that conform to the specification:

User_info_manage,

Weather_data,

T_electric_gather and so on.

Column Family name

  • The combination of English words and Arabic numerals, where the word must be capitalized, and the first character must be English characters, cannot be a number.
  • Lengths are limited to 1~6 characters, and long column family names consume more storage space.
  • Compliant column family name:

D1, D2, data and so on.

  • Deprecated Column Family name:

User_info, D_1 and so on.

Column Name

  • Using the combination of English words, Arabic numerals, connectors (_), where the word must be uppercase, and the first character must be English characters, not a number, the connector can be used to stitch multiple words.
  • Limit the length to between 1~16 characters as much as possible.
  • Try to use English words with definite meanings, but not the phonetic alphabet of Chinese characters or the combination of pinyin initials.
  • Canonical column names:

USER_ID, data_1, remark and so on.

  • Deprecated column names:

UserID, 1_data and so on.

Shangbing

Unit: Henan Electric Power Research Institute, Intelligent Grid

qq:52190634

Home: http://www.cnblogs.com/shangbingbing

Space: http://shangbingbing.qzone.qq.com

HBase Application Development Review and Summary series one: Overview hbase design specifications

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.