Optimizing HBase Query Operations-dramatically improving read-write rates

Source: Internet
Author: User

Environment: SuSE 8G Memory, 8 cores, 12T disk HBase master occupies one, the other 7 as HBase Region Server Note: Hadoop is not discussed here
Scenario:           We have 700 million of the data, we need to do a query, we need to find a field from 170 million of the table and write to the table of 700 million data.           Here in order to describe the convenience, the table of 700 million data is called: A table, 170 million data table called B table.                      in the early days, we used the logic of reading the data from table A to one line and organizing a get by one of the fields. Then           immediately submit, from B to get the value of the field to query, organized into a put, and submitted to a table.           in so doing, it is completely ineffective to hbase because the key between each get is not necessarily continuous, or it is in the same scope         & nbsp The HBase server will load data into memory based on each request, so that the data is replaced too frequently in memory because the requested block is dispersed. The           is likely to be similar to reading disk data directly.           and using this logic to process data, it takes too much time. It takes about 40 minutes to read and write data in almost 10W rows. This is completely unacceptable for update operation           for table A.                 after reading the data read operations to encapsulate, organized into a arraylist<get> when to a certain degree of time to take the submission. There is also a situation where the           Some of the data can not be queried, then need to connect the database to apply for an automatic allocation of the value, and immediately submitted, because the following may have requested this data.           This needs to be handled separately.       &NBsp   To organize a get list, you need to first query, note, do not use table.get to take a cell value and determine whether NULL to handle.           instead of using table.exist (get) to query, which is run in server-side, the efficiency is much higher.           apply for a value immediately and write to a table for the value not in the query.           in the case of a query, you can add the previously mentioned organization get to a get list, to a certain extent, and then commit again, after the value is taken,           On the basis of the records that will circulate the data, these organizations into put, quantity and get list like, do not go to specify, after the loop once direct table.put                      Other parameter changes, write a lot, here is not mentioned.
Processing speed (take to a task running on one of the servers): 2011-12-30 17:10:03 Start Write lines:1700000 2011-12-30 17:14:10 writed lines:1700000 2011-12-30 1 7:14:11 start write lines:1800000 2011-12-30 17:18:21 writed lines:1800000 2011-12-30 17:18:22 Start write Lines:1900000 2 011-12-30 17:22:29 writed lines:1900000 2011-12-30 17:22:29 Start Write lines:2000000 2011-12-30 17:26:37 Writed Lines:200 0000 2011-12-30 17:26:37 Start Write lines:2100000
Approximately the query, the write speed is 4 minutes processing 10W rows of data. That is, the rate of 4000/s, which increases the level of the previous approach.
PS: Welcome to engage Hadoop,hbase Friend contact me, mutual exchange

Reference http://www.cnblogs.com/chenjingjing/archive/2010/01/26/1656895.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.