Optimizing HBase Query Operations-dramatically improving read-write rates

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Environment: SuSE 8G Memory, 8 cores, 12T disk HBase master occupies one, the other 7 as HBase Region Server Note: Hadoop is not discussed here
Scenario: We have 700 million of the data, we need to do a query, we need to find a field from 170 million of the table and write to the table of 700 million data. Here in order to describe the convenience, the table of 700 million data is called: A table, 170 million data table called B table. in the early days, we used the logic of reading the data from table A to one line and organizing a get by one of the fields. Then immediately submit, from B to get the value of the field to query, organized into a put, and submitted to a table. in so doing, it is completely ineffective to hbase because the key between each get is not necessarily continuous, or it is in the same scope & nbsp The HBase server will load data into memory based on each request, so that the data is replaced too frequently in memory because the requested block is dispersed. The is likely to be similar to reading disk data directly. and using this logic to process data, it takes too much time. It takes about 40 minutes to read and write data in almost 10W rows. This is completely unacceptable for update operation for table A. after reading the data read operations to encapsulate, organized into a arraylist<get> when to a certain degree of time to take the submission. There is also a situation where the Some of the data can not be queried, then need to connect the database to apply for an automatic allocation of the value, and immediately submitted, because the following may have requested this data. This needs to be handled separately. &NBsp To organize a get list, you need to first query, note, do not use table.get to take a cell value and determine whether NULL to handle. instead of using table.exist (get) to query, which is run in server-side, the efficiency is much higher. apply for a value immediately and write to a table for the value not in the query. in the case of a query, you can add the previously mentioned organization get to a get list, to a certain extent, and then commit again, after the value is taken, On the basis of the records that will circulate the data, these organizations into put, quantity and get list like, do not go to specify, after the loop once direct table.put Other parameter changes, write a lot, here is not mentioned.
Processing speed (take to a task running on one of the servers): 2011-12-30 17:10:03 Start Write lines:1700000 2011-12-30 17:14:10 writed lines:1700000 2011-12-30 1 7:14:11 start write lines:1800000 2011-12-30 17:18:21 writed lines:1800000 2011-12-30 17:18:22 Start write Lines:1900000 2 011-12-30 17:22:29 writed lines:1900000 2011-12-30 17:22:29 Start Write lines:2000000 2011-12-30 17:26:37 Writed Lines:200 0000 2011-12-30 17:26:37 Start Write lines:2100000
Approximately the query, the write speed is 4 minutes processing 10W rows of data. That is, the rate of 4000/s, which increases the level of the previous approach.
PS: Welcome to engage Hadoop,hbase Friend contact me, mutual exchange

Reference http://www.cnblogs.com/chenjingjing/archive/2010/01/26/1656895.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Optimizing HBase Query Operations-dramatically improving read-write rates

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Optimizing HBase Query Operations-dramatically improving read-write rates

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support