Recently, Twitter developed a distributed real-time statistics system, Rainbird.
Usage
Rainbird can be used for real-time data statistics:
1. count the number of clicks on each page and domain name on the website
2. Internal System Operation Monitoring (count the running status of the monitored server)
3. Record the maximum and minimum values
Performance Requirements
As a distributed application of large websites, the following performance is required:
1 High Write Performance, up to 100,000 WPS
2. High read performance, up to 10,000 s of RPS
3. high scalability, including reading and storage, which can be expanded to 100 + Tb
4. The read speed and response interval are short. The reading speed of the vast majority should not exceed 100 ms.
System Components
Rainbird is based on zookeeper, Cassandra,
Scribe, the distributed real-time statistics system of thrift. The basic functions of these basic components are as follows:
1. zookeeper is a Distributed Coordination System in the hadoop subproject. It is used to control the consistency among various components in the distributed system.
2 Cassandra is a very good nosql product. It integrates the dynamo and bigtable distributed storage systems to store the data and statistical data that requires statistics, and provides clients to query the statistical data. (The distributed counter patch Cassandra-1072 is required)
3 Scribe: Facebook's open-source distributed log collection system collects various data sources to Cassandra.
4 thrift: Facebook's open-source cross-Language C/S network communication framework. developers can easily develop C/S applications based on this framework.
Overall Design
The design architecture of Rainbird is as follows:
Zookeeper is responsible for coordination and disaster tolerance among components in the entire Rainbird system, and Cassandra is responsible for the storage and statistics of the entire data.
Scribe is deployed in the front end to collect the data that requires statistics, and then the collected data occurs to the Rainbird aggregator in real time.
Rainbird aggregator caches collected data (1 MB), pre-processes the cached data, and then writes the data to Cassandra in batches at a time. The role of preprocessing is similar to that of combiner in the mapreduce framework. Reduce is performed on maper.
Rainbird query accepts the user's query request and returns the statistical data directly to Cassandra to the client.
Page url statistics example
How can we use Rainbird to collect statistics on website page clicks?
In the statistical process, this blog articleArticleIs: http://www.cnblogs.com/gpcuster/tag/Cassandra/
We can split this URL into the following four parts:
Com
Cnblogs
WWW
Http://www.cnblogs.com/gpcuster/tag/Cassandra/
The following key is combined with the split parts:
Com, cnblogs, WWW, http://www.cnblogs.com/gpcuster/tag/Cassandra/
Com, cnblogs, WWW
Com, cnblogs
Com
Finally, write the data of these keys into Cassandra. This completes the entire statistical process.
If you want to query the response page.
If you need to query how many times the page http://www.cnblogs.com is accessed, you just need to query the value of key com, cnblogs, WWW in Cassandra.
If you want to query the number of times http: // * cnblogs.com is accessed, you can perform a similar query.
More references
For more details, see: http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
For more information about Cassandra, see: http://www.cnblogs.com/gpcuster/tag/Cassandra/.
For more information about zookeeper, see: http://www.cnblogs.com/gpcuster/tag/ZooKeeper/