Usage of hbase in data statistics

Source: Internet
Author: User
1 . Data statistics requirements

An important application of data statistics on the Internet is the statistics on website data, such as cnzz webmaster statistics, Baidu statistics, Google Analytics, and quantum track statistics.

Website Statistics tools provide the following functions:

1) website traffic statistics: including PV, UV, IP, and other indicators. These statistical indicators can be displayed in the form of a trend chart, such as the last week and the last month.

2) IP Source Information Statistics: records the number of PV accesses under each source IP address.

3) Access source analysis: records the ways in which visitors arrive at the website.

4) Analysis of search engines and keywords: This module analyzes the changes and trends of access PVS generated by various specified search engines, and collects statistics on the traffic trends of visitor search keywords in different periods.

5) access area analysis: collects statistics on the trend of PV views and UV visitors in different regions in different time periods.

6) recent visitor traffic: displays the current website access status in real time, including the access time, IP address, source URL, access URL, and source region.

From a statistical perspective, the requirements for these business functions can be summarized:

1) Calculation of various statistical indicators, such as PV, UV, and IP, can be attributed to operations such as sum and AVG for one piece of data.

2) the demand for statistics is increasingly demanding for real-time performance. Access sources occur anytime, anywhere, and the sources are diversified. For such requirements, we do not need to calculate statistics, but need to quickly display the data they are concerned about to users after preprocessing.

3) data statistics can be divided into two parts: one is the statistics of real-time data, which dynamically displays the site access data updates; the other is the statistics of historical data, for example, it is used for report analysis.

2 . Hbase Implementation ideas

Hbase is a distributed storage system that can easily build a large-scale storage system on a low-cost PC to store massive data. This makes hbase a suitable storage system as a site data statistics Tool.

1) for statistics on real-time data, hbase can provide low-latency read/write access and withstand high-concurrency access requests. for statistics on historical data, hbase can be regarded as a huge key-value storage system that stores historical access information of each website for offline data analysis and report generation.

2) For operations such as PV, UV, and IP that require accumulative calculation (sum/avg), the relevant records in the hbase table must be scanned and computed, therefore, if the data volume of the site to be counted is large, using hbase may not guarantee a fast response speed. That is to say, the response from the front end to the final result will take a long time (more than 1 second or longer ). This issue will be discussed in section 3rd.

3) for real-time data display such as site visitor flow information, hbase is suitable for use. As long as we have designed a reasonable key, therefore, the response speed will be very fast when a single access record is obtained based on the key.

The following is a structure using hbase as a storage system:

Hbase server refers to hbase cluster and application.ProgramWrite and read hbase on the receiving end and query end respectively.

From the hbase application perspective, there are two different directions:

1) In the first direction, hbase is regarded as a reliable and available key-value storage system with a huge capacity. The function of using hbase is very simple. It is used as a black box, store Data with a sparse structure according to the previously designed table structure. Based on this idea, if hbase cannot fully meet the business needs, we should design or optimize the application level to meet the business needs.

2) In the second direction, because hbase is open-source, hbase itself can be improved and expanded to form a stable and available hbase version that can meet business needs.

3 . How to solve the problem

For the problem of using hbase for accumulate calculation (sum/avg) mentioned in section 2nd, the following provides several ideas and methods to solve the problem.

Based on the first direction:

1) hbase server performs aggregation computing, so that the query end of the application does not need to request hbase to respond to a large amount of data for transmission, but only the results after the server computing, so it can meet the needs of real-time response.

Based on the second direction:

1) when designing an hbase table, add an empty column for statistics to reduce data transmission from the hbase server to the query end.

2) Application-Side Computing:

A)Warehouse receiving end: when designing an hbase table, add a table dedicated to storing the accumulative results such as PV and UV. Each time a new piece of data is generated, first, query the number of PVS/Uvs recorded last time in the hbase table, and then determine whether to add 1, and then re-write it back to the corresponding key in the hbase table. In this way, the query end can directly obtain PV/UV through a get operation of hbase.

B)Query end: add the PV/UV cache to the query end. When the next query request comes, the PV/UV value is cached, add the number of records added to the scan hbase table (if the cache update period is short enough, the number of new records will be small and the query response to hbase will be fast ).

4 . Summary

here is a summary of some experiences in the use of hbase for data statistics applications. We have made some attempts to solve the problem.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.