A GitHub project is recently completed: Storm-hbase, which is a combination of Twitter storm and Apache hbase. It uses hbase cluster as the storm spout data source. Currently, it is only a preliminary implementation, it will be further improved in the future.
Hbasespout reads stream data from hbase cluster continuously based on the timestamp range [start_timestamp, stop_timestamp:
- If start_timestamp = 0, hbasespout reads and sends data from 3 minutes ago to the storm cluster by default; otherwise, it reads data from the user-specified start_timestamp.
- If stop_timestamp = 0, hbasespout reads data of the current time by default, and continuously reads new data and sends it to the storm cluster over time; otherwise, the system reads the specified stop_timestamp and stops reading the data.
The above design for [start_timestamp, stop_timestamp] is to adapt to different running modes:
- In the most typical cases, start_timestamp = 0 and stop_timestamp = 0, read and send data three minutes ago, and then scan new data in hbase cluster synchronously and send it to storm cluster, suitable for real-time computing scenarios.
- If a problem occurs, for example, if the storm cluster restarts and the computing task status is lost, you may need to be able to spout the data rewind, by specifying [start_timestamp, stop_timestamp] can meet this requirement.
Storm-hbase tries its best to be universal, so it extracts the configuration information of storm and hbase. For storm-hbase configuration options, you can find them in the src/main/resources/storm. properties and src/main/resources/hbase. properties files of the GitHub project. If the schema structure of your hbase table is similar to the preceding one, storm-hbase can be used after simple configuration.
The current implementation of hbasespout is based on the following assumptions:
- The rowkey format of the hbase table is [shardingkey, timestamp,...];
- Shardingkey occupies 1st bytes, indicating the number of data partitions in the table, which is generally less than 100. Therefore, the short type storage is used here;
- Timestamp occupies 2nd ~ Five bytes, indicating the timestamp information in the data. Here, timestamp is a UNIX timestamp value in seconds, and INT type storage is used.
For more information about storm-hbase and its progress, please join the GitHub link of the project: https://github.com/ypf412/storm-hbase