HBase Two-level indexing scheme summary

Source: Internet
Author: User

Transferred from: http://blog.sina.com.cn/s/blog_4a1f59bf01018apd.html

Attached to hbase how to create a two-level index and create a two-level index instance: http://www.aboutyun.com/thread-8857-1-1.html

Huawei two-level index (principle): http://my.oschina.net/u/923508/blog/413129

In HBase, the rowkey of the table is sorted by dictionary, and region is shard by the Rowkey setting split point, which is the largest weight of its success by the global, distributed index implemented in this way. Figure 1 shows the deployment diagram for the Rowkey shard of the HBase table and region.

Figure 1:hbase Rowkey-region diagram


However, with the drive applied on the hbase system, it is found that global-rowkey-indexing no longer satisfies the needs of the application. A single way to retrieve data through Rowkey no longer meets the needs of more applications, and people want to retrieve data like SQL, select * from table where col=val. However, hbase before the positioning is large table storage, to make such a query, often through a system like hive, pig and other systems for the MapReduce calculation, this method is not only a waste of machine computing resources, but also because of high latency to make the application overshadowed. As a result, in the industry and community, the HBase secondary indexing scheme became the most vocal feature of HBase's new version (0.96).

Rough analysis of the current technology, the approximate scenario can be summed up in such two categories:

1. Use HBase's coprocessor. Coprocessor equivalent to HBase Observer+hook, currently supports Masterobserver, Regionobserver, and Walobserver, basically for hbase table management, data put, Delete, get, and so on can find the corresponding pre*** and post***. This way, if you need to create a secondary indexing for an item column, you can update the information to another index table when you put or delete it. The second shows that the value of indexing inside of the problem of storage, can be based on the need to control, if the value of space overhead, reverse retrieval is more frequent, can be stored directly in the indexing table, and vice versa to avoid this situation.

Figure 2 Implementing secondary with HBase coprocessor indexing

2. The client initiates the dual operation of the put and delete operations for the primary and index tables. From: http://hadoop-hbase.blogspot.com/2012/10/musings-on-secondary-indexes.html "outside the Wall"

Its concrete approach is summed up by:

    • Set the TTL of the main table (time to Live) the peso primer smaller, letting it die a little bit earlier.
    • Do not store value values in Indexingtable, that is, delete the Val column shown in 2.
    • When the put operation, for all columns of the operation's main table, use the same local timestamp value, update to Indexing table, and then use the timestamp to insert the main table data.
    • Delete operation, the data of the primary table is first manipulated, and then the data of the indexing table is updated.

Although there is no guarantee of atomicity and consistency in this scheme, the timestamp, no locks and no server-side codes, make it a great advantage in the two-level index. As for the middle error, let's see if we can tolerate:

1) The Put index table succeeds, and the put main table fails. Since indexing table does not store Val values, it still needs to jump to main Table, so this error is equivalent to taking a stale index to access the corresponding Rowkey bar, which has no effect on the correctness of the results.

2) Delete Main Table succeeded, delete index table failed. Is the content of the index table >= the contents of the main table, and the actual return value needs to be done through the main table.

In the production environment, what kind of method is more practical?

On this issue, according to the individual's current experience in the production environment HBase cluster, the combination of the above two methods of the pros and cons, can be designed in this way.

1, the main table service online business, its performance needs to be guaranteed. The use of coprocessor and the encapsulation of the client will affect its performance, so under normal circumstances, the direct operation is not appropriate. If you want to use scenario two, I do feel that you can adjust the operation of the indexing table to remove content that guarantees its security, such as the ability to turn off write Hlog, which further reduces the delay of its operation.

2, Offline Update index table. In a scenario where a level two index is really needed, the timeliness requirements are often low. The index can be updated in real-time to a redis-like KV system and periodically updated from KV to indexing table in HBase. Ps:redis has the concept of DB setup, which can be isolated by time period, so that data in a certain period of time will be updated to Redis, ensuring that Redis import MapReduce can still update operations.

PS: Community and production systems the scenario for hbase two-level indexing continues to be an ongoing concern.

HBase Two-level indexing scheme summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.