author Dong Lin, a senior technical director.
In the near future, intelligent Big data service providers "push" to launch the application of statistical products "number", today we will talk about the number of real-time statistics and AI data Intelligent Platform Integration architecture design.
Many people may be curious to have a tens of billions of SDK push, focus on the message push service for many years, now why do application statistics? After all, there are already a lot of similar products on the market. I think the answer is "the people and the".
First of all, the timing. At present, the Internet industry has developed into the so-called "second half" or even "overtime", operations to refinement, dau and effects are put first, practitioners have gradually recognized the importance of data optimization and use of a good model.
The other is geography. A push after years of accumulation, with a solid data base, in addition, a push infrastructure is also very mature, and in many vertical areas of the real provide a lot of data services.
The third is people and. In-house research and development staff have accumulated a wealth of experience in combat, the company and external application developers and partners to establish a long-term close contact.
It is in this context, we have introduced the application of statistical products "number."
The popular vocabulary of the previous period is "growth hacker", and at this stage, the simple user growth has been unable to meet the development, the company and product thinking points are "effect." Compared to other statistical products, the soul of the product is operations, that is, around the core KPI, to maintain the active degree of application, improve the overall revenue.
Safe, accurate and flexible data can ensure the efficient operation of operations, while the platform for hosting data needs to be high concurrency, high availability, high real-time; The core of the SDK as a foundation is that the package is small enough and easy to integrate and run quickly. Such a pyramid, from top to bottom, builds up a number of products.
Four core competencies to create intelligent statistics
First, real-time multidimensional statistics is the basic function of the entire application statistics. Among them, stability and real-time are two key points; in terms of granularity, page-level statistics are best for operators.
The second part is data integration. With a push of big data capabilities, we can provide a unique third-party perspective that helps the application identify itself and find its place in the industry.
The third part is the automatic modeling prediction. This is a very unique number of function points. We want to help application developers truly experience the value of the model through a complete set of solutions, and continuously optimize and improve the product through real data feedback.
Part IV is precision push. One of the most well-known capabilities is the push service, which effectively integrates the in-app statistics with the push system to aid more granular operations.
Technical Architecture: Business Domain + data domain
The overall structure of the number is divided into business domain and data domain. The data domain is divided into three levels: Data gateway layer, data service layer and data platform layer.
The data gateway layer mainly carries on the carrier between the business layer and the data layer, including the Kafka cluster and the API gateway, which makes the data interworking. The data business layer is mainly based on specific business research and development work, because this part of the work is not common between the platform, and thus is a separate layer. Under this layer, the product is configured with a number of separate Hadoop clusters depending on the functionality, while the core competencies are packaged as public services for use by business developers.
The Business domain section includes traditional microservices and corresponding storage modules.
First, the data firewall between the two tiers is very important, and level two data firewalls ensure effective isolation of data within the system.
Second, the data domain is layered. In this case, the number of three layers set up to correspond to three different functional teams, data management layer-data operations, data business Layer-line of business research and development team, data Platform Layer-data Division, such functional division can effectively improve the business Line product development efficiency.
Third, the isolation of cluster resources. Open clusters of lines of business need to be separated by resources. In addition, isolating the GPU compute cluster resources is also very necessary.
Four, real-time and offline balance. At the time of development, regardless of the product, we always need to take into account both real-time and offline two scenarios.
Finally, the data is stored. The business line, the data layer, the platform layer must have the corresponding data storage. In addition, reasonable planning should be adopted to ensure that each type of data is stored in the appropriate location.
Real-time multidimensional statistical schema parsing
The Mobile API collects the escalated data from the SDK, saves it as a file log, enters into Kafka via Flume, and then processes it in real time and offline, and finally through the data API encapsulation to the upper layer of the business system.
In the offline statistics, the number can be supported to the hour level. At the same time, we will monitor the flow of data throughout the process, when data loss or delay, etc., to ensure that the first time to monitor.
There are several key points to be addressed here: User deduplication, page uniqueness identification, multi-dimensional statistics processing strategies, and ensuring that data is not lost in every link.
Data integration, providing multidimensional metrics
Push has powerful big data capabilities that can provide rich data dimensions for application statistics products.
First, the device fingerprint. There are problems with the compatibility of mobile devices, and a push solves this problem by playing a unique device ID tag for the app.
Second, the third-party perspective provides application retention, installation, uninstallation, active and other neutral analysis data.
Third, the user portrait. Whether it's a static label like sex or age, or a tag like a hobby, you can get it by pushing big data platforms.
Automatic Modeling prediction & Model evaluation
A standardized modeling work generally consists of the following steps: First select a batch of positive and negative sample users, and then perform feature completion, the non-feature to reduce the dimension of operation; After that, choosing the right model for training, which is also a very CPU consuming process; Next is the target forecast, We need to collate or complement all the characteristics of the target user, then put the data into the model, get the prediction results, and finally the model evaluation. After the model is evaluated, the next iteration is adjusted, and the cycle repeats.
In the modeling process, real-time is one of the important factors to be considered. The most traditional off-line training is a very conventional way of modeling. Forecasting can choose high-performance offline, but its disadvantage is that feedback is too slow, it is possible to result in no other opportunity to implement the Operation plan, so we need to provide more real-time prediction capabilities. For example, after a user has newly installed or completed an operation, the system gains real-time predictive results and immediate operational intervention.
Finally, real-time training, from my personal point of view, this is the future development of a direction.
For the entire modeling infrastructure, there is no doubt that we have chosen TensorFlow, the current mainstream model can be implemented under the TensorFlow. It has many advantages: support distributed deployment, can be concurrent, integrated extension, can support cluster serving, can provide model services in the form of API ... It is therefore well suited for predicting the technical architecture of the service.
The offline modeling process is as follows: After the data fell to HDFs, the first through the Azkaban task scheduling, data cleaning after the application of statistical data collection, the next will be a push-owned big data capabilities and integration, the formation of a monolithic data cube input to the TF cluster, TF cluster According to the configuration of the forecast event , the model training is synthesized, and the result is finally output.
The Target Prediction implementation scheme is relatively simple, only need to import the model into the TensorFlow serving cluster. The prediction results are then encapsulated by DAPI and are called to the upper layer of the business.
The goal prediction is to make feature completion first. The challenge is to anticipate and perfectly complement each new user's requirements as quickly as possible.
The second part is the forecast result. The final result is the probability value, we need to assess whether the probability value is in a reasonable range, the probability distribution is consistent with our expectations. If it does not, we need to reassess the model or think that the prediction is ineffective.
The third part is the TensorFlow cluster. By containerized deployment, you can deploy the predictive service to a separate pod. Depending on the real-time requirements, the number of APIs can be provided in the form of external services, or can provide real-time callbacks.
Model evaluation is a critical step in forecasting, and the incomplete evaluation system may lead to the final result being unavailable.
Accuracy and recall, the two basic indicators related to predictive accuracy, are important to focus on. Due to the accuracy associated with thresholds, we also support developers to adjust independently.
Lift is also an important indicator that reflects how much of an effect our forecasts can produce. Obviously, the larger the proportion of people screened, the proportion of ascension will gradually decrease. For specific applications, we need to choose a reasonable value based on the scenario or demand.
ROC and AOC, these two indicators are used as the overall model evaluation indicator for evaluating the performance of models under different thresholds. In order to enhance the distinguishing ability of the model, we are bound to pursue AOC maximization. The AOC value is a quantitative indicator that is suitable for continuous monitoring of the model. In addition, it is necessary to make a daily assessment of the model, and if the AOC value does not meet expectations, we can select other models in a timely manner.
In terms of monitoring, first make sure that the test user's choice is random enough. Every day we select a batch of test users to verify the effectiveness of the model, and then evaluate the accuracy, recall, and AOC. In addition to the internal check, we will also provide this indicator to the developer. At the same time, the historical data of the cached prediction results can be used to assist the evaluation of daily effects.
Precision push integration, augmented reality scenarios
The application of embedded point data and prediction results can be transferred to the push system through the number, so that developers in the push link directly in the form of crowd package to select the target user, or download the crowd package, upload to the wide-point pass and other platforms for advertising.
Number Roadmap
The number of products in May has been formally opened to the public, you can freely register and use http://www.getui.com/cn/geshu.html. The model prediction function is currently in beta, and we hope to be able to formally open up our capabilities to Q4, to help you understand the model, use the model, and enjoy the value that the model brings.
Application Statistics Platform Architecture design: Intelligent Predictive App statistics