HBase learning Summary (5): HBase table design, hbase table
I. How to start Pattern Design
When talking about schema, consider the following:
(1) How many columnfamily should this table have?
(2) what data does the column family use?
(3) How many columns should each columnfamily have?
(4) What is the column name? (Although the column name does not need to be defined during table creation, it is required to read and write data .)
(5) What data does a unit store?
(6) How many time versions are stored in each unit?
(7) What is the row key structure? What information should be included?
1. Problem Modeling
All data of a specific column family is stored physically on HDFS. This physical storage may consist of multiple hfiles. Ideally, an HFile can be obtained through merging. All columns of a column family are stored together on the hard disk. With this feature, columns of different access modes can be placed in different column families to isolate them. This is why HBase is called column-family-oriented store.
Define access patterns as early as possible in the pattern design process to validate your design decisions.
In order to define the access mode, we recommend that you define the table to answer questions.
2. requirement definition: it is always good to prepare more in advance.
The column qualifier can be processed by data, just like a value. This is different from the relational system. The name of the column in the relational system is fixed and needs to be pre-defined during table creation.
HBase does not have the concept of cross-row transactions. It avoids the need for transaction logic design in the client code, because it will make you have to maintain complex clients.
3. Modeling of balanced distribution data and Load
The speed of HBase operations involves many considerations. Including:
(1) Number of KeyValue entries in the table (including the put result and the tombstone mark left by the delete operation ).
(2) Number of HFile blocks in HFile.
(3) Average number of KeyValue entries in an HFile.
(4) Average number of columns in each row.
The number of entries in the MemStore at any specified time in the e-generation table. Because MemStore is implemented by using the skip table (skip list), the time complexity of searching rows is O (log e ).
A row of a wide table contains many columns. The high table (tall table) is a new mode. The KeyValue Object Storage column family name in HFile is helpful in reducing hard disk and network I/O by Using Short Column family names.
In HBase context, the hotspot refers to the region where the load is extremely concentrated. This is unreasonable because the load is not distributed across the entire cluster. Several servers serving these region have taken on most of the work and will become the bottleneck of overall performance.
4. Target Data Access
In HBase tables, only keys (the Key part of the KeyValue object, including row keys, column delimiters, and timestamps) can be indexed. The only way to access a specific row is through the row key.
Creating an index on the column Qualifier and timestamp allows you to directly jump to the correct column on a row without scanning all the previous columns.
There are two ways to obtain data from a table: get and scan. You can use get to call a row. In this case, you must provide the row key. If you want to perform a scan, if you know the start and stop keys, you can choose to use them to limit the number of rows scanned by the scanner object.
Based on a part of the specified key, you can limit the amount of data read from the hard disk or transmitted over the network. If the row key is specified, only the required rows are returned, but the server returns the whole row to the client. Specifying the columnfamily allows you to further limit what part of the row to be read, because if the row key is across multiple columnfamilies, you can only read a subset of HFile. Further specifying the column Qualifier and Timestamp can reduce the number of columns returned to the client, thus saving network I/O.
Putting data into the unit value and putting it into the column qualifier or row key takes up the same storage space, but moving data from the unit to the row key may have better performance.
Basic knowledge:
(1) HBase tables are flexible and can be stored in character arrays.
(2) store all data in similar access modes in the same column family.
(3) The index is built on the Key part of the KeyValue object. The Key consists of the row Key, column qualifier, and timestamp in order.
(4) high tables may allow you to reduce the computational complexity to O (1), but you have to pay the price for Atomicity.
(5) Anti-standardization is a feasible way to design the HBase mode.
(6) Think about how to complete the access mode in a single API call rather than multiple API calls. HBase does not support cross-row transactions. Therefore, you must avoid maintaining this complex logic in the client code.
(7) hashed columns support fixed-length keys and better data distribution, but the benefits of sorting are lost.
(8) A column qualifier can be used to store data, just like a unit.
(9) because the data can be placed into the column qualifier, its length affects the storage space. When accessing data, it also affects the hard disk and network I/O overhead, so try to be concise.
(10) The length of the column family name affects the data size (in the KeyValue object) sent back to the client through the network, so try to be concise.
Ii. Anti-Standardization
Normalization is a technology in the relational database world. Each type of duplicate information is put into a table of its own. There are two advantages: When an update or deletion occurs, you do not have to worry about the complexity of updating all copies of the specified data; by saving a single copy instead of multiple copies, you can reduce the storage space occupied. When a query is required, use the JOIN clause in the SQL statement to re-JOIN the data.
Denormalization is an inverse concept. The data is repeated and there are multiple places. Because you no longer need a JOIN clause with high overhead, it makes it easier and faster to query data.
From the performance point of view, standardization is optimized for writing, while de-standardization is optimized for reading.
3. Mixed data in the same table
Try to separate different access modes.
Iv. Design Principles of row keys
When designing an HBase table, the row key is the only thing that matters. We should model the row key based on the expected access mode.
The row key determines the performance that can be obtained when accessing the HBase table. This conclusion is rooted in two facts: region provides services for rows in one interval based on the row key, and is responsible for each row in the interval; HFile stores ordered rows on the hard disk. HFile is generated when region is flushed to the row in memory. These rows have been sorted, and will be written to the hard disk in an orderly manner. The ordering feature and underlying storage format of HBase tables allow you to infer Performance Based on how to design row keys and put column delimiters.
Relational databases can create indexes on multiple columns, but HBase can only create indexes on keys. The only way to access data is to use row keys. If you do not know the row key of the data you want to access, you must scan quite a few rows.
V. I/O considerations
The following tips optimize the design line key for the access mode.
1. Write Optimization
How should we distribute data across multiple region databases?
(1) hash
If you are willing to discard the timestamp information in the row key, using the hash value of the original data as the row key is a possible solution.
The hash algorithm has a non-zero collision probability. It is also important to use the hash function.
(2) salting
When thinking about the composition of row keys, salting is a technique.
2. Read Optimization
Read as few HFile data blocks as possible into the memory to obtain the dataset to be searched. Because the data is stored together, more information can be obtained each time the HFile data block is read than when the data is stored separately.
The structure of the row key is very important for read performance.
3. Base and row key structure
An effective row key design should consider not only what to put into the row key, but also their position in the row key.
The location of information in the row key is equally important as the information you choose to place.
6. From relational to non-relational
There are no shortcuts to ing knowledge from relational databases to HBase. They are different ways of thinking.
Relational databases and HBase are different systems. They have different design features and can affect the design of application systems.
1. Some Basic Concepts
Relational Database Modeling involves three main concepts:
A. entity- ing to table ).
B. attribute- ing to column ).
C. relationship-maps to a foreign key (foreign-key ).
(1) Entity
In relational databases and HBase, the object container is a table, and each row in the table represents an instance of the object. Each row in the User table represents a user.
(2) attributes
To map attributes to HBase, there must be at least two types of attributes:
A. identifying attribute: This attribute uniquely identifies an instance (that is, a row) of an object ). In a relational table, this attribute forms the table's primary key ). In HBase, this attribute is part of the rowkey.
An object is often identified by multiple attributes, which maps to the compound keys concept in a relational database.
B. non-identifying attribute: In HBase, they are basically mapped to column delimiters.
(3) Contact
The logical relationship model uses two main links: one-to-many and one-to-many. In relational databases, the former is directly modeled as a foreign key, and the latter is modeled as a connection table ).
HBase does not have a built-in join or constrain, and almost no display contact is used.
2. nested entities
HBase columns (also called column delimiters) do not need to be pre-defined during design. They can be anything. HBase can nest another entity in the row of a parent or primary object, but this is far from a flexible schema row ).
Nested entities are another tool for ing from relational databases to non-relational databases.
If the only way you get a child entity is through the parent entity, and you want to have transaction-level protection on all the child entities of a parent entity, this technique is the most correct choice.
VII. columnfamily Advanced Configuration
1. configurable data block size
The size of HFile data blocks can be set at the column family level. The data block index stores the START key of each HFile data block. The data block size configuration affects the size of the data block index. The smaller the data block, the larger the index, the larger the memory space occupied. At the same time, the random search performance is better because the data blocks loaded into the memory are smaller.
2. Data Block Cache
The data is put into the read cache, but the performance improvement is often not achieved by the workload.
3. Aggressive caching
You can select columnfamily to give them a higher priority (LRU cache) in the data block cache ).
4. Bloom Filter
The bloom filter allows a reverse test of the data stored in each data block. When a row is requested, first check the bloom filter to see if the row is not in the data block.
5. TTL)
HBase allows you to set a TTL at the column family level within a few seconds. Data earlier than the specified TTL value will be deleted during the next big merge. If you have multiple time versions on the same unit, versions earlier than TTL are deleted.
6. Compression
HFile can be compressed and stored on HDFS. HBase can use multiple compression encodings, including LZO, Snappy, and GZIP.
Note that the data is compressed only on the hard disk and is not compressed during memory or network transmission.
7. unit time version
By default, each HBase Unit maintains three time versions. This attribute can be set.
You can also specify the minimum time version of the column family.
8. filter data
A filter is also called a push-down judge. It allows you to push data filtering criteria from the client to the server.
Commonly Used filters include:
1. Row Filter
This is a pre-installed comparison filter that supports row-key-based data filtering.
2. prefix filter
This is a special case of row filter, which is based on the prefix value of the row key.
3. Qualifier Filter
It is a comparison filter similar to the row filter. The difference is that it is used to match the column qualifier rather than the row key. It uses the same comparison operator and comparator type as the row filter.
4. value filter
It provides the same functions as row filters or delimiter filters, but only for unit values.
5. timestamp Filter
It allows more fine-grained control over the time versions returned to the client.
6. Filter list
It is often useful to use multiple filters in combination.
IX. Summary
The starting point of pattern design is the problem, not the relationship.
The pattern design will never end.
Data scale is the first essential factor.
Each dimension is an opportunity to improve performance.
My public account: zhouzxi. Please scan the following QR code:
Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.