TA (Tencent Analytics), a free website analysis system for third-party webmasters, is highly praised by webmasters for data stability and timeliness, and its second-level real-time data update frequency is also recognized by the industry. This article will take you in-depth exploration of TA system architecture and implementation principles from many aspects of real-time data processing, data storage and so on.
Web Analytics (Web Analytics) mainly refers to the Web-based user browsing behavior, the site's clickstream data and operational data analysis to monitor the operation of the site status, to provide a basis for the optimization of the site. Web site analysis system has become an indispensable tool for day-to-day operations of webmasters, the industry's most popular web site analysis systems are Google Analytics, CNZZ and Baidu statistics and other products.
As a rising star of web analytics products, TA has formed its own characteristics in many aspects such as community analysis, user portraits and website tools. Its second-level real-time data update frequency is even more a leader in the industry. TA enjoys a good reputation in the station circle in terms of data stability, accuracy and timeliness. With the continuous development of the access traffic, the average amount of data that needs to be processed and calculated by the TA reaches the terabyte level. Such a large amount of data want to achieve second-level real-time and ensure the availability of the system is not easy.
TA's real-time computing framework draws on the thinking of some popular streaming computing systems in the industry. Although some problems have been encountered in the construction of the system, due to the real-time processing of massive data, real-time storage has some typicality and versatility, so TA solutions to share, I hope to give you some inspiration.
Basic principles and system architecture
The basic principle of TA is to collect user access behavior data through the JavaScript script embedded in the website of the webmaster and send the TA collection cluster. After receiving the data, the collection cluster filters and encodes the data, and then continues to distribute the data after formatting. The data processing cluster is responsible for calculating the data according to the business logic and "writing" the calculation results into the data storage cluster, and finally, displaying the result data to the webmasters. The basic principle of TA is shown in the figure.
TA background is a complete set of data flow processing system: the user behavior data collected by JavaScript flows into the TA background like a steady stream of rivers, after cleaning, calculating and then continuously flowing out to the TA storage cluster for users to browse and query. TA's specific structure and core components as shown.
TA's background is divided into offline and real-time two parts: real-time part of the system is responsible for the main function of computing, data update frequency of seconds; offline part of the system is responsible for complex correlation analysis and cross-days calculation, data update frequency of days.
Http Access: HTTP protocol is mainly responsible for the analysis, data cleaning and formatting. ESC: Event Streaming Coder, which is mainly responsible for encoding the data type of the system that can not be enumerated into an integer, and persisting the corresponding relationship. ESP: Event Streaming Processor, is mainly responsible for the data in accordance with the site, UID reorganization and calculation of PV, UV, stay and bounce rate and other site analysis indicators. ESA: Event Streaming Aggregator, responsible for summarizing ESP calculated data by site and writing to Redis. Center: The system's central node, responsible for system configuration, data routing management, and assume disaster recovery switching function. Logserver: responsible for the data collected by Access is written to the file as a string and uploaded to the TDCP. TDCP: Tencent distributed computing platform, is responsible for the calculation of offline data, and the results will be written to MySQL script.
Real-time solution
Before introducing the TA real-time solution, let's first understand the business under TA support. The current TA daily needs to process the terabytes of data on hundreds of thousands of websites. After the URLs are processed, there are still hundreds of millions of URLs and the system stores more than one billion keys. How to deal with such a large amount of business data efficiently and with low delay is the main challenge that the real-time system of the TA faces. The main idea of TA solution can be summarized as data full binary, full-memory computing, storage NoSQL. The following real-time computing and real-time storage of these two sub-systems in-depth discussion.
Real-time calculation
For computing subsystems, we refer to open source projects such as Hadoop, S4 and Storm, trying hard to design a more general and extensible full-memory real-time Event processing system (or the popular terminology called streaming real-time Event Processing System) . For such a system, we design and support the typical input and output flow roughly as shown.
Real-time computing system design points in the data organization, protocols and incremental calculation model.
Data organization. All things int, taking into account the memory and performance requirements of the calculation process, we will all non-int data types into int. Can enumerate the data type, the configuration mapped to a unique int; unencrypted data type, then use the MD5 algorithm to approximate a unique int. For example, if the page URL is of an unencrypted type, the preprocessor obtains a unique int through the MD5 algorithm. The browser type string in the UserAgent belongs to enumerable data, and the preconfigured mapping is int. This method saves more memory and improves the computational performance of the entire system.
protocol. At the protocol level, we first designed and implemented an extensible Event structure that supports semi-automated serialization / deserialization mechanisms (refer to msgpack's design) and compact binary coding (based on Zigzag coding, reference Protobuf implementation). This Event structure behaves quite well in streaming high-performance I / O (network transport and persistence). Real-time computing subsystem is designed to be extended to support any Event implementation.
Incremental calculation model. The incremental calculation model, which refers to the basic calculation process, is defined as the following three parts (as shown)
Processor: responsible for the calculation of specific business logic processing. Data Holder: Responsible for saving the incremental result data, as well as calculating the dependent intermediate status data. Emitter: It is responsible for periodically outputting the calculation result of emptying increment.
Specific to the process, is divided into the following three steps (as shown).
Receive Event, Calculation Process - Processor. Save calculation results and calculate the dependent data - DataHolder. Timing trigger output time slice calculation results, emptied calculation result-Emitter.
Incremental computing model weakens the transactional state of a single machine in a distributed system, simplifying the implementation of distributed computing systems and improving the overall system performance.
Real-time storage
In the TA system, the data stored in real time are statistics that need to be read through the web presentation layer. There are two typical characteristics of this type of data.
Frequently updated write. Update frequency Depending on the real-time system, each statistical results can be updated as fast as 1 second. A small amount of reading. "Small amount" is relative to the above update. At the same time according to business logic, statistics can be divided into two categories. Fixed data: mainly URL, search keywords and other data. This part of the data theory is constantly increasing, will not modify the old data. Dynamic Data: Mainly results of frequently updated statistics. This part of the data needs to be constantly updated. For example, www.qq.com under the domain name PV and UV statistics.
Taking into account the above characteristics of TA real-time statistics, we choose NoSQL to achieve our storage system; the same time, for two different types of data were selected LevelDB and Redis to store.
Redis
TA real-time storage of the main components. Taking into account the TA system itself is a relatively complete distributed cluster system, so we need storage components is "not clustering, but sharding." In other words, "heavy weapons" like HBase and MongoDB are not suitable for TA, and the "Swiss Army Knife" Redis in the NoSQL database comes into our sight with its outstanding performance. At the same time, the result data type of TA is also rich. There are Hash types such as PV, UV, VV and IP in the site, and "dynamic data" such as set types like user access trajectories. The rich data structure of Redis completes well This task.
Another reason to choose Redis is that it is simple enough and extensible. In the actual application process, we found that the problems can be solved by extending the Redis command.
For example, there is such a scenario in the TA that in order to eliminate the state of the ESA module, the data stored in Redis is often not the final result data, but the intermediate data that needs further operations. Like bounce rate this indicator (bouncerate = bounce session number / total session number), need to query the front desk twice to do an operation after the final display to the user. In the case of high concurrency, will undoubtedly affect the system's response speed.
In line with the principle of "mobile computing instead of mobile data", we extended Redis' sort and hmget commands to support four operations and successfully optimized the original two queries to one. Another purpose of extending the four operations is to "store by calculation." For example, type data that needs to be summed up for two types can be stored in only two parts, and the summed data is "calculated by calculation."
In addition to data reading, data writing can also be optimized for similar merged data. For example, the TA needs to call the Redis command 6 times when writing six indicators of PV, UV, VV, IP, dwell time and bounce rate of the URL. In fact these six indicators are stored in the same Hash, by extending the hmincrby command to support all fields of Hash a change, you can optimize the number of calls to once. On-line also achieved good results, peak CPU utilization dropped by almost half, but also significantly increased the throughput of the upper module ESA.
LevelDB
It is a valid addition to Redis. Considering that Redis is an in-memory database and uses more memory than hard disk, Choosing LevelDB, which is based on disk storage, has been chosen as a supplement. Because LevelDB's write performance is good enough and read performance far exceeds current "online read-only" requirements, we chose LevelDB to store "fixed data."
In the data storage architecture design, due to the high reliability of real-time data services and online systems, we mainly take the double copy + Sharding design method.
Double copy. All data stores write at least two simultaneously, increasing the availability of online system services.
Sharding.
Based on the domain name: All data is organized in the domain name of the shard; any domain name can be adjusted to any shard; a single domain name data is in principle stored in a shard.
Dynamic adjustment (as shown): only adjust the sharding strategy, do not move the data; based on the amount of data to calculate the shard load.
In addition, we mainly do three tasks for the data query of the shard cluster (as shown in the figure).
Redis Protocol Stack is a more complete Redis protocol stack, is the basis of the upper application. The Redis protocol is used directly as a universal protocol for providing queries to outside parties so that external users can query access data directly through the various Redis Client implementations. Query Rule Engine is a flexible query engine. According to the rules, it can intelligently query and execute class join in multiple Redis and LevelDB data sources. It also supports other heterogeneous data sources such as MySQL and HBase. The Query Compute Engine is a real-time query calculation engine that calculates in real time based on the underlying query results. The main purpose of this section is to reduce the Redis data space footprint.
Future outlook
Although the current TA has done in the background seconds update data, but the display is still the traditional static mode. Follow-up TA in the data on the dynamic refresh more attempts to allow webmasters can be the first time to understand the website marketing effectiveness, always feel the site heartbeat.
about the author
Chu Dapeng Tencent data platform senior engineer, Tencent analyst in charge of technology, website analysis, streaming data processing, massive data mining and other fields of experience. Wang Qiying Tencent data platform senior engineer responsible for Tencent analysis of the architecture design and development, has a wealth of high concurrency, high-performance background system design, architecture experience.