As a push industry leader, the current Push SDK cumulative installation coverage of 10 billion (including overseas), access to more than 430,000 applications, independent terminal coverage of more than 1 billion (including overseas). A push system generates a lot of logs and data every day and faces many challenges in data processing.
First of all, data storage, a push to produce more than 10TB per day of data, and accumulated data is at the petabyte level. Second, as a push technology service provider, there are a lot of data analysis and statistical requirements from customers and various departments of the company, such as: Message push and Data report. Although part of the data analysis work is offline mode, but the stability of open source data processing system is not very high, to ensure the high availability of data Analysis Services is also a challenge. In addition, the push business is not a simple message issued, it needs to help customers through the data analysis to the appropriate content in the appropriate scene to the right people, which requires the system to support data mining, and ensure the real-time data. Finally, a push requires a fast response to data analysis requirements. Therefore, a push-big data system faces the challenges of data storage, log transmission, log analysis processing, large-scale task scheduling and management, high availability of data analysis and processing services, massive multi-dimensional reports, and fast response analysis and fetching requirements.
The path of big Data system evolution
Faced with many challenges, a push big data system in the gradual development of continuous improvement. Its development can be divided into three stages. One is the statistical report, that is, the traditional meaning of bi, the second is the basic construction stage of big Data system, the third is the tool, the service and the product.
The first stage of the evolution of Big Data System: Statistical report calculation
Early due to the lack of complex data processing requirements, a push to select a number of high-performance machines, all the data placed on these machines on the calculation. Simply run PHP or shell scripts on the machine with multiple processes to complete processing and statistics. Data processing more attention to how many messages a customer pushes today, how many receipts a push task has, and so on, to perform a relatively simple report.
This stage of the push big Data system is characterized by only the operation and maintenance of the timing of the script transfer to the specified intermediate node; Although the user has billions of levels but a single type of log, just use PHP, shell scripts to run and data only for short-term preservation (result set long-term preservation, intermediate data and raw data save a short time).
The second stage of the evolution of Big Data system: Big Data infrastructure, off-line batch processing system
2014 push to launch smart push solution. User volume of the Big Star app access, the system covers the number of users explosion. And the customer access to a push system, put forward a lot of new requirements such as: report statistical dimension Richer, it requires the data volume doubled in the case of more complex calculations, the calculation of pressure increases. Secondly, the essence of intelligent push is data deep mining, the longer the data preservation period, the more the coverage dimension is better.
In this case, the introduction of the Hadoop ecosystem, using HDFS to basically solve the problem of storage, using hive for data warehousing and offline analysis, and use mahout machine learning. A push completes the shift from a single or multi-machine mode to a cluster direction. The whole process is similar to the original, the difference is that after the log is transferred to the transit node, use HDFS command put data to HDFs, and add Hive table partition, and then further processing the log, import into the data warehousing. The last push data mining data Warehouse, to the user tag, warehousing to hbase and online es and so on. This is the infrastructure of the offline batch processing system.
The second stage of the evolution of Big Data system: Big Data infrastructure, real-time processing system
As the business continues to evolve, so does the need for growth. If a lot of statistical analysis tasks to meet the requirements in t+0 time, or customers to push the message, PM request to reflect the push effect of the data report, and can not wait until the t+1 time, these requirements of data processing real-time to put forward higher requirements. And many customers will be asked to retrieve some data, or to view some kind of tag-related data, this kind of fetch need to respond quickly. So a push. Some adjustments have been made to the original architecture, introducing a schema pattern that primarily includes offline processing, real-time processing, and data services (including retrieval).
From the top, the original data is stored in HDFs, using SPARK,MR, etc. for offline batch processing. Introduce Kafka to solve the problem of log collection, use flume to collect the logs of each business node, write to Kafka cluster, and then process the hour and second levels according to the classification of the business. The final push will land a copy of the data and sync it to the line of business db or ES for use.
The infrastructure phase pushes through several tasks: using a lambda architecture (Batch layer, Speed Layer, Servinglayer), introducing Hadoop (Hdfs, HIVE/MR, Hbase, mahout, etc.), using ES, solrcloud+ hbase scheme realizes multi-dimensional retrieval, introduces Flume, Kafka, Camus and optimized transformation log transmission and introduces and optimizes the domestic open source Redis cluster scheme-codis.
A big push Data system Evolution Phase III: Tooling + service + Product
In the process of infrastructure construction, although there is an overall framework, it is still not easy to respond to the needs of business parties. So a push to choose to provide tools to the business side, and add a service agent layer, that is, the red part, the batch processing tasks such as abstraction into a task template, configuration to the agent layer, and finally to the business party calls, they just do a simple two times development, you can use a push cluster computing services, improve business development speed.
This stage, a push on the architecture of the main completion of the following work : increase job scheduling management: The introduction of Azkaban and transformation (variable sharing, multi-cluster support, etc.); Add Service agent layer: Introduce DataService and Job proxy ( Open to more product lines to use and decouple); Add Application layer: Based on the service agent layer to develop the corresponding tools and take the number of products.
Experience and summary of the evolution of a big Data system
First, exploring data and understanding data is a pre-requisite work for development. Before data processing, you need to explore what dirty data, the distribution of these dirty data, and the discovery of invalid data and default conditions. Second, data storage scenarios are closer to analysis and computing needs. You might consider using an indexed file format like Carbondata. Thirdly, data standardization is the primary means to improve the follow-up treatment. The vast majority of data need to be standardized after the supply of subsequent use (basic cleaning, unified internal ID, to increase the required properties), such as real-time data, should be standardized processing, then released into the Kafka, and finally for all other real-time systems to do processing, reduce the conventional cleaning and conversion processing in multiple business repeated, and unified ID , easy and data access. Four, the tool, service, product to improve the overall efficiency. At the development level, you can encapsulate Mr, Spark APIs, and provide adequate toolkits. The whole link monitoring of big Data system is very important. Batch processing monitoring mainly includes: Daily task uptime monitoring, tilt, result set daily curve, abnormal data curve, GC monitoring; Streaming monitoring includes: original data fluctuation monitoring, consumption rate monitoring alarm, calculation node delay monitoring, etc.
How to build a big data system with 10 billion SDK cumulative coverage