Introduction to the storm-hbase GitHub Project

Source: Internet
Author: User

A GitHub project is recently completed: Storm-hbase, which is a combination of Twitter storm and Apache hbase. It uses hbase cluster as the storm spout data source. Currently, it is only a preliminary implementation, it will be further improved in the future.

Hbasespout reads stream data from hbase cluster continuously based on the timestamp range [start_timestamp, stop_timestamp:

    • If start_timestamp = 0, hbasespout reads and sends data from 3 minutes ago to the storm cluster by default; otherwise, it reads data from the user-specified start_timestamp.
    • If stop_timestamp = 0, hbasespout reads data of the current time by default, and continuously reads new data and sends it to the storm cluster over time; otherwise, the system reads the specified stop_timestamp and stops reading the data.

The above design for [start_timestamp, stop_timestamp] is to adapt to different running modes:

    • In the most typical cases, start_timestamp = 0 and stop_timestamp = 0, read and send data three minutes ago, and then scan new data in hbase cluster synchronously and send it to storm cluster, suitable for real-time computing scenarios.
    • If a problem occurs, for example, if the storm cluster restarts and the computing task status is lost, you may need to be able to spout the data rewind, by specifying [start_timestamp, stop_timestamp] can meet this requirement.

Storm-hbase tries its best to be universal, so it extracts the configuration information of storm and hbase. For storm-hbase configuration options, you can find them in the src/main/resources/storm. properties and src/main/resources/hbase. properties files of the GitHub project. If the schema structure of your hbase table is similar to the preceding one, storm-hbase can be used after simple configuration.

The current implementation of hbasespout is based on the following assumptions:

    • The rowkey format of the hbase table is [shardingkey, timestamp,...];
    • Shardingkey occupies 1st bytes, indicating the number of data partitions in the table, which is generally less than 100. Therefore, the short type storage is used here;
    • Timestamp occupies 2nd ~ Five bytes, indicating the timestamp information in the data. Here, timestamp is a UNIX timestamp value in seconds, and INT type storage is used.

For more information about storm-hbase and its progress, please join the GitHub link of the project: https://github.com/ypf412/storm-hbase

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.