Distributed Real-time statistics system-Rainbird

Source: Internet
Author: User
Tags cassandra

Recently, Twitter developed a distributed real-time statistics system, Rainbird.


Usage

Rainbird can be used for real-time data statistics:

1. count the number of clicks on each page and domain name on the website

2. Internal System Operation Monitoring (count the running status of the monitored server)

3. Record the maximum and minimum values


Performance Requirements

As a distributed application of large websites, the following performance is required: 

1 High Write Performance, up to 100,000 WPS

2. High read performance, up to 10,000 s of RPS

3. high scalability, including reading and storage, which can be expanded to 100 + Tb

4. The read speed and response interval are short. The reading speed of the vast majority should not exceed 100 ms.


System Components

Rainbird is based on zookeeper, Cassandra,
Scribe, the distributed real-time statistics system of thrift. The basic functions of these basic components are as follows:

1. zookeeper is a Distributed Coordination System in the hadoop subproject. It is used to control the consistency among various components in the distributed system.

2 Cassandra is a very good nosql product. It integrates the dynamo and bigtable distributed storage systems to store the data and statistical data that requires statistics, and provides clients to query the statistical data. (The distributed counter patch Cassandra-1072 is required)

3 Scribe: Facebook's open-source distributed log collection system collects various data sources to Cassandra.

4 thrift: Facebook's open-source cross-Language C/S network communication framework. developers can easily develop C/S applications based on this framework.


Overall Design

The design architecture of Rainbird is as follows:

Zookeeper is responsible for coordination and disaster tolerance among components in the entire Rainbird system, and Cassandra is responsible for the storage and statistics of the entire data.

Scribe is deployed in the front end to collect the data that requires statistics, and then the collected data occurs to the Rainbird aggregator in real time.

Rainbird aggregator caches collected data (1 MB), pre-processes the cached data, and then writes the data to Cassandra in batches at a time. The role of preprocessing is similar to that of combiner in the mapreduce framework. Reduce is performed on maper.

Rainbird query accepts the user's query request and returns the statistical data directly to Cassandra to the client.


Page url statistics example

How can we use Rainbird to collect statistics on website page clicks?

In the statistical process, this blog articleArticleIs: http://www.cnblogs.com/gpcuster/tag/Cassandra/

We can split this URL into the following four parts:

Com

Cnblogs

WWW

Http://www.cnblogs.com/gpcuster/tag/Cassandra/

The following key is combined with the split parts:

Com, cnblogs, WWW, http://www.cnblogs.com/gpcuster/tag/Cassandra/

Com, cnblogs, WWW

Com, cnblogs

Com

Finally, write the data of these keys into Cassandra. This completes the entire statistical process.

If you want to query the response page.

If you need to query how many times the page http://www.cnblogs.com is accessed, you just need to query the value of key com, cnblogs, WWW in Cassandra.

If you want to query the number of times http: // * cnblogs.com is accessed, you can perform a similar query.

 

More references

For more details, see: http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011

For more information about Cassandra, see: http://www.cnblogs.com/gpcuster/tag/Cassandra/.

For more information about zookeeper, see: http://www.cnblogs.com/gpcuster/tag/ZooKeeper/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.