Environment: SuSE 8 GB memory, 8-core, 12 TB disk hbase master occupies one server, and the other 7 servers serve as the region server of hbase. Note: we will not discuss the hadoop scenario here: we have 0.7 billion of the data, to query, You need to search for a field from the 0.17 billion table and write it to the 0.7 billion data table. For ease of description, 0.7 billion data tables are called table A and 0.17 billion data tables are called Table B. In the initial stage, we used the logic to read a row of data in Table A, organize a get according to a field, and then submit it immediately, obtain the value of the field to be queried from Table B, organize it into a put, and submit it to Table. In this case, the efficiency of hbase cannot be fully realized, because the key of each get is not necessarily continuous, or in the same range, the hbase server loads data to the memory according to each request. Because the request blocks are scattered, the data is frequently replaced in the memory. It is likely to be similar to Directly Reading disk data. It takes too much time to process data using this logic. It takes about 40 minutes to read and write 10 million rows of data. This is completely unacceptable for table A update operations. Then, the data is encapsulated by reading data and organized into an arraylist <get> which is submitted to a certain extent. In this case, if some data cannot be queried, you need to connect to the database to apply for an automatically assigned value and submit it immediately, because this data may be requested later. This requires separate processing. When organizing the get list, you must first query it. Note that you should not use table. Get to retrieve the value of a cell and determine whether it is null for processing. Instead, we use table. exist (get) for query. This is run on server-side and the efficiency is much higher. Apply for values that cannot be queried immediately and write them to Table. For the queried results, you can add the Organization get mentioned above to the get List and submit it again to a certain extent. After obtaining the value, you can record the cyclic data, organize these items into put. The number is the same as the get list. Do not specify them. After one loop, directly table. there are a lot of writes for modifying other put parameters, so we won't mention them here. Processing speed (take the task running on one of the servers): 2011-12-30 17:10:03 start write lines: 17000002011-12-30 17:14:10 writed lines: 17000002011-12-30 17:14:11 start write lines: 18000002011-12-30 17:18:21 writed lines: 18000002011-12-30 17:18:22 start write lines: 19000002011-12-30 17:22:29 writed lines: 19000002011-12-30 17:22:29 start write lines: 20000002011-12-30 17:26:37 writed lines: 20000002011-12-30 17:26:37 start write lines: 21 00000 is about query, and the write speed is 4 minutes to process 10 million rows of data. That is, the speed is 4000/s, which is higher than the previous processing method. PS: Welcome to contact me for hadoop and hbase.
Reference http://www.cnblogs.com/chenjingjing/archive/2010/01/26/1656895.html