After many modifications, the operation is very stable and has been used in the production environment. Internal thread communication uses the wait ()/notify () mechanism, which is highly efficient. This article only discusses the multi-thread mechanism of HBase at the Java level. The multi-thread mechanism of HBase API is not covered in this article. The jar package required by the HBase client is not provided here. It can be downloaded at the end of the source code. The following demo demonstrates the usage. The code is packaged and downloaded at the end.
Features:
■ Using multithreading at the Java level maximizes read/write performance.
■ Greatly reduces the amount of application code, and now you can focus on data analysis.
■ The average performance is at least several orders of magnitude higher than that of a single thread.
HScan
HScan has five public methods:
1. public HScan (String table, String [] columns, int threadNum ),
Table specifies the table name and columns specify the column to be scanned. If it is set to null, all columns are scanned. If it is set to an empty array, only rowkey is scanned, and threadNum sets the number of concurrent threads of the scanner.
2. public void addTask (String start, String end, int num). The first two parameters are the start and end rowkey. The third parameter is used to control the rowkey to be split into num segments, this parameter is mainly used for the rowkey at the beginning of the timestamp that can be calculated by integer type. If the number is not in the integer range, set this parameter to 1. This method can be called multiple times to add num tasks to the task pool each time.
3. public void fuck (), inherited method, starts scanning.
4. public HashMap <String, String> fetch (), obtains a result from the buffer. The key in the result is composed of the column name and "rowKey, if null is returned, all rows have been read.
5. public void status (), a non-synchronous method, used to show the running status of the thread in an approximate way. It is always displayed in the current row. The content consists of four parts. The first column shows the number of returned results, the second column shows the number of rows in the buffer, and the third column shows the number of concurrent threads/Thread. state. number of WAITING/Thread. state. number of BLOCKED, and the fourth column shows the number of tasks in the task pool.
Demo: scans all data in the gm_player_detail table for January 1, January 1-10, 2013.
The code is as follows: |
Copy code |
Package joyport. hbase. gm;
Import java. util. HashMap; Public class Test { Private static int threadNum = 30; Private static int taskNum = 1000; Private static String htable = "gm_player_detail "; Public static void main (String [] args) throws Exception { Int [] time = new int [2]; If (args. length = 2 ){ Time = Util. getTimeScale (args [0], args [1]); } Else if (args. length = 1 ){ Time = Util. getTimeScale (-1 ); Time [0] = Util. getTimeScale (args [0], args [0]) [0]; } Else { Time = Util. getTimeScale (-1 ); } HScan hScan = new HScan (htable, null, threadNum ); HScan. addTask (String. valueOf (time [0]), String. valueOf (time [1]), taskNum ); HScan. fuck (); Test test = new Test (); Test. analyze (hScan ); } Public void analyze (HScan hScan) throws InterruptedException { HashMap <String, String> row = null; For (row = hScan. fetch (); row! = Null; row = hScan. fetch ()){ HScan. status (); // System. out. println (row ); } } } |
Util. getTimeScale (-1) obtains the start and end timestamp of yesterday.
Util. getTimeScale (args [0], args [1]) obtains the start and end timestamps based on dates in the yyyy: mm: dd format. java uses a time that may span days during microsecond computation, if this function is always used, there will be no time break or repetition.
Compile and run
Java joyport. hbase/gm/Test 2013-01-01 2013-01-10
HGet
The get operation is still very useful. For example, if you need to do so, count the first n records of each day, but there are many record fields, you can first compute the first n based on only the required fields, and then return to get detailed data based on rowkey, which is much higher than the computing performance directly on detailed data. Because the rowkey that requires get is stored in the memory, you need to control the memory usage if you need a large number of get operations. The public method is similar to the HScan method, so you can understand the code.
Demo: only scan rowkey and obtain all columns based on rowkey with get.
The code is as follows: |
Copy code |
Package joyport. hbase. gm;
Import java. util. HashMap; Public class Test { Private static int threadNum = 30; Private static int taskNum = 1000; Private static String htable = "gm_player_detail "; Public static void main (String [] args) throws Exception { Int [] time = new int [2]; If (args. length = 2 ){ Time = Util. getTimeScale (args [0], args [1]); } Else if (args. length = 1 ){ Time = Util. getTimeScale (-1 ); Time [0] = Util. getTimeScale (args [0], args [0]) [0]; } Else { Time = Util. getTimeScale (-1 ); } String [] cols = {}; HScan hScan = new HScan (htable, cols, threadNum ); HScan. addTask (String. valueOf (time [0]), String. valueOf (time [1]), taskNum ); HScan. fuck (); Test test = new Test (); Test. analyze (hScan ); } Public void analyze (HScan hScan) throws Exception { HashMap <String, String> row = null; HGet hGet = new HGet (htable, null, 30 ); For (row = hScan. fetch (); row! = Null; row = hScan. fetch ()){ HScan. status (); // System. out. println (row ); HGet. addTask (row. get ("rowKey ")); } HGet. fuck (); For (row = hGet. fetch (); row! = Null; row = hGet. fetch ()){ HScan. status (); // System. out. println (row ); } } }
|
HPut
HPut, HScan, and HGet are not the same, because the put operation can have multiple data sources and the types can be arbitrary (from text, database, HBase table, stream ), the data source thread is also encapsulated. The number of data source Threads is determined by the user.
1. public HPut (String table, int threadNum) threadNum is the number of writing threads.
2. public void addTask (final Callable <HashMap <String, String> task) returns the interface for a row of records. Each call generates a new data source thread, when null is returned, the data source thread ends.
3. public void enableStatus (boolean enable). Even a similar state needs to be displayed in a proper place. The user program shows a large error, so it is displayed in the put operation, this function is used to control whether the put status is displayed. It is displayed by default.
Demo: only one data source is displayed. Multiple data sources are not tested (theoretically it should be okay ). Read data from gm_player_detail for one day and insert it to the hbase_test table.
The code is as follows: |
Copy code |
Package joyport. hbase. gm; Import java. util. HashMap; Import java. util. concurrent. Callable; Public class Test extends Thread implements Callable <HashMap <String, String >> { Private int threadNum = 30; Private int taskNum = 1000; Private String htable = "gm_player_detail "; Private HScan hScan; Public static void main (String [] args) throws Exception { Int [] time = new int [2]; If (args. length = 2 ){ Time = Util. getTimeScale (args [0], args [1]); } Else if (args. length = 1 ){ Time = Util. getTimeScale (-1 ); Time [0] = Util. getTimeScale (args [0], args [0]) [0]; } Else { Time = Util. getTimeScale (-1 ); } Test t1 = new Test (time [0], time [1]); // Write data to hbase_test HPut hPut = new HPut ("hbase_test", 10 ); HPut. addTask (t1 ); HPut. fuck (); } Public Test (int startkey, int endkey) throws Exception { String [] cols = null; HScan = new HScan (htable, cols, threadNum ); HScan. addTask (String. valueOf (startkey), String. valueOf (endkey), taskNum ); HScan. fuck (); } Public HashMap <String, String> call () throws InterruptedException { Return hScan. fetch (); } } |
HDelete
The delete operation only requires rowkey, but the data structure in the buffer zone is dead. Therefore, the HashMap <String, String> format is also used, with rowKey as the key. If row. isEmpty () is true, this data is skipped.
Demo: Read all rows from hbase_test and delete all rows.
The code is as follows: |
Copy code |
Package joyport. hbase. gm;
Import java. util. HashMap; Import java. util. concurrent. Callable; Public class Test implements Callable <HashMap <String, String >> { Private static int threadNum = 30; Private static String htable = "hbase_test "; Private static HScan hScan; Public static void main (String [] args) throws Exception { String [] cols = {}; HScan = new HScan (htable, cols, threadNum ); HScan. addTask ("0", "2", 1 ); HScan. fuck (); HDelete h = new HDelete (htable, 10 ); H. addTask (new Test ()); H. fuck (); } Public HashMap <String, String> call () throws Exception { Return hScan. fetch (); } } |
Notes:
■ The default buffer size is 10000. The number of external program threads (not the concurrency defined in the constructor) cannot exceed this value; otherwise, it may lead to unlimited wait ().
■ At present, the task pool of HScan does not perform deduplication, that is, the starting and ending rowkey crossover of the two tasks and the same task repetition are not performed internally. To avoid repeated scanning, external guarantee is required. The HGet task also requires external guarantees.
■ The HScan and HGet tasks must be added at one time, because the threadStart () operation will close the buffer once the judgment is complete, resulting in task loss.
■ This class only performs multi-thread operations at the Java level! The internal synchronization mechanism of HBaseClient is not within the scope of this category!
■ Config. java uses the config. conf file in the same directory, and zookeepers and port configuration should be used.
Possible improvements:
■ The basic type of HBuffer is HashMap <String, String>, which is the column name and value respectively. If you change HashMap <String, String> to an array to save only values, create a new HRow to replace HashMap <String, String> as a row of data, and use the column name as the static attribute of HRow to save only one copy, HRow maintains the ing between column names and values, which saves the memory occupied by a large number of column names repeatedly and reduces the GC burden. It can effectively improve memory usage when memory operations are performed on a large amount of data. The CPU occupied by HRow maintenance correspondence can be ignored.
■ The HBase read thread and program read thread share a buffer zone, resulting in the use of the same object monitor for production and consumption. In this case, the notify () efficiency is not optimal, the efficiency may be reduced slightly when the speed is unstable. There is no solution at present.