Environment: SuSE 8G Memory, 8 cores, 12T disk HBase master occupies one, the other 7 as HBase Region Server Note: Hadoop is not discussed here
Scenario: We have 700 million of the data, we need to do a query, we need to find a field from 170 million of the table and write to the table of 700 million data. Here in order to describe the convenience, the table of 700 million data is called: A table, 170 million data table called B table. in the early days, we used the logic of reading the data from table A to one line and organizing a get by one of the fields. Then immediately submit, from B to get the value of the field to query, organized into a put, and submitted to a table. in so doing, it is completely ineffective to hbase because the key between each get is not necessarily continuous, or it is in the same scope & nbsp The HBase server will load data into memory based on each request, so that the data is replaced too frequently in memory because the requested block is dispersed. The is likely to be similar to reading disk data directly. and using this logic to process data, it takes too much time. It takes about 40 minutes to read and write data in almost 10W rows. This is completely unacceptable for update operation for table A. after reading the data read operations to encapsulate, organized into a arraylist<get> when to a certain degree of time to take the submission. There is also a situation where the Some of the data can not be queried, then need to connect the database to apply for an automatic allocation of the value, and immediately submitted, because the following may have requested this data. This needs to be handled separately. &NBsp To organize a get list, you need to first query, note, do not use table.get to take a cell value and determine whether NULL to handle. instead of using table.exist (get) to query, which is run in server-side, the efficiency is much higher. apply for a value immediately and write to a table for the value not in the query. in the case of a query, you can add the previously mentioned organization get to a get list, to a certain extent, and then commit again, after the value is taken, On the basis of the records that will circulate the data, these organizations into put, quantity and get list like, do not go to specify, after the loop once direct table.put Other parameter changes, write a lot, here is not mentioned.
Processing speed (take to a task running on one of the servers): 2011-12-30 17:10:03 Start Write lines:1700000 2011-12-30 17:14:10 writed lines:1700000 2011-12-30 1 7:14:11 start write lines:1800000 2011-12-30 17:18:21 writed lines:1800000 2011-12-30 17:18:22 Start write Lines:1900000 2 011-12-30 17:22:29 writed lines:1900000 2011-12-30 17:22:29 Start Write lines:2000000 2011-12-30 17:26:37 Writed Lines:200 0000 2011-12-30 17:26:37 Start Write lines:2100000
Approximately the query, the write speed is 4 minutes processing 10W rows of data. That is, the rate of 4000/s, which increases the level of the previous approach.
PS: Welcome to engage Hadoop,hbase Friend contact me, mutual exchange
Reference http://www.cnblogs.com/chenjingjing/archive/2010/01/26/1656895.html