Reading Notes-hbase in action-Part 2 advanced concepts-(1) hbase Table Design

Source: Internet
Author: User

This chapter introduces the hbase schema design mode with the example of Twitter. Broadly speaking, hbase schema design includes not only the specified items when creating a table, but also related content such as column families/column qualifier/cell value/versions/rowkey.

Flexible schema & simple storage View

Schema Design is closely related to data storage and access modes. Let's first review the hbase data model, which has several key points:

  1. The indexed part includes row key + Col fam + Col qual + time stamp.
  2. Because of the schema-less and columnar storage features of hbase, columns can be dynamically added without being defined during table creation. The column name is also stored in hfile. It is no different to use cell value to store data.

Step-by-Step practice

The primary starting point of schema design is modeling the problem to be solved. The following section summarizes the relevant design principles from the process of gradually improving the Twitter user relationship table.

Questions to be answered in the user relationship table follows include:

  1. Which users do user a care about?
  2. Does user a care about user B?
  3. Who follows user?

The initial version is designed as follows. The user is the rowkey, And the follows Column family contains multiple columns. Each column name is the serial number, and the stored values are the user-oriented values.

 

The most obvious problems are:

  1. When a user adds a new follow, the user also needs to find the current followed person, incremental sequence number and other logic.
  2. It is generally more efficient to search for a specific user.

The improvements are as follows:


Wide Table vs narrow table

Previously designed as the Wide Table mode, a row of records contains all the users who follow the table. If a narrow table is used, the schema is as follows:

The biggest advantage of narrow table design is that you can use rowkey to find and efficiently answer the previous question 2: whether user a pays attention to user B. Question 1: which users a users are concerned about is the scanning operation, but from the hbase underlying columnar storage, i/O reads the same amount of data (the get operation is internally implemented to scan a single row ). The code snippet is as follows:

Scan s = new Scan();s.addFamily(Bytes.toBytes("f")); s.setStartRow(Bytes.toBytes("A"));s.setStopRow(Bytes.toBytes("A"+ 1));ResultScanner results = followsTable.getScanner(s);

The biggest problem with narrow watchband is that hbase is atomic only when one row is operated. If you are new to multiple users, you can perform one put atomic operation in the Wide table, but multiple operations are required in the narrow table operation.

Note: To answer question 3, who pays attention to user a can create another user table to be followed. The rowkey is followed + follower. The application maintains data consistency between the two tables.

Rowkey Design

Hash rowkey helps to evenly distribute data between regionservers. Generally, MD5 can be used to obtain the fixed-length key.


One tricky scenario is to use a timestamp as the rowkey in time series data, so you always insert data at the bottom of the table because of the orderliness of the rowkey, the last region of the table becomes a hotspot. The application needs to scan and query based on the time range, so it cannot simply hash the timestamp. In this case, you can consider the "salting" method:

int salt = new Integer(new Long(timestamp).hashCode()).shortValue()% <number of region servers>byte[] rowkey = Bytes.add(Bytes.toBytes(salt)\?+ Bytes.toBytes("|") + Bytes.toBytes(timestamp));
The generated rowkey is as follows. Query Processing becomes a little more complex and needs to be merged at the application end.

0|timestamp10|timestamp50|timestamp61|timestamp21|timestamp92|timestamp42|timestamp8
Anti-Standardization

In the previous section, rowkey contains the ID of the person to be followed. The name of the person to be followed is stored in CQ, instead of joining the user table for query. This is a degree of de-standardization.

 

Because of the dynamic nature of hbase columns, a single hbase table can be used to express one-to-many relationships in a database using nested columns (similar to the MongoDB document model ).

Take Twitter as an example. If there are follows and twits tables, the user will read the followers information through follows after logging on, then, read the followers's twits from the twits table based on the user ID read in the first step, and merge them to obtain the latest results on the user's homepage. You can consider adding a redundant table to store the twits displayed on the user's homepage. rowkey is the logon user + inverted timestamp, and stores the latest twits of the person who follows. This table provides better read performance and can solve a common problem: a large v account is concerned by a large number of users. If redundant tables are not used, the regionserver where twits data is located will become a hotspot, which may lead to performance bottlenecks.


You can use coprocessor to generate data for this table (this is similar to a database trigger in the following chapter). Data Retention is implemented by TTL (time to live, next section)

Column family configuration

Hbase provides some configuration parameters that can be customized as needed during table creation.

1.Hfile block size: Different from HDFS block size, the default size is 64 KB. The block index stores the starting key of each block, so the block size will affect the index size. If your application focuses on random search, you can select a smaller block size. If you focus on sequential scanning, you can use a larger block size.

hbase(main):002:0> create'mytable', {NAME => 'colfam1', BLOCKSIZE => '65536'}

2.Block Cache: Block cache is not so important for sequential scanning. you can disable the cache and leave the memory space to other tables or columns.

hbase(main):002:0> create'mytable',?{NAME => 'colfam1', BLOCKCACHE => 'false’}

3.Aggressive caching: To set a higher block cache priority for certain columns, hbase will more actively keep it in the LRU cache.

hbase(main):002:0> create'mytable', {NAME => 'colfam1', IN_MEMORY => 'true'}

4.Bloom Filters: Only the starting key of the block is stored in the block index. The default block size is 64 KB. If each row of data in the table is small, too many lines are recorded in the block, which may take a long time to search, the queried data does not exist. Introducing negative test using bloom filter can quickly determine whether data exists

hbase(main):007:0>create 'mytable',{NAME=> 'colfam1', BLOOMFILTER => 'ROWCOL'}
5. TTL: Used to automatically clear expired data

hbase(main):002:0>create 'mytable', {NAME => 'colfam1', TTL => '18000'}

6.Compression: The snappy format released by Google is a good choice.

hbase(main):002:0>create 'mytable',?{NAME => 'colfam1', COMPRESSION => 'SNAPPY'}<span style="font-family: Arial, Helvetica, sans-serif;"> </span>

7.Cell Versioning: The default value is 3.

base(main):002:0>create 'mytable', {NAME => 'colfam1', VERSIONS => 1}

Filter

The filter function is used on the regionserver, and the data will still be loaded from the disk to the regionserver. Therefore, the filter generally reduces the network I/O, instead of Hard Disk I/O (some filters can reduce Hard Disk Data Reading, such as columnprefixfilter ).

Hbase provides the filter interface, which allows you to customize the filter function. The filtering method callback sequence is as follows:


Hbase also has some preset filters. A typical example is rowfilter.

public RowFilter(CompareOprowCompareOp, WritableByteArrayComparable rowComparator)

Compareop indicates the comparison operator enumeration class, such as equal, greater than, and less. Comparator represents the specific comparison logic. Common examples include string comparison, regular expression matching, and binary comparison.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.