With the rapid development of the Internet, the relationship between big data and software quality is getting closer and closer. From source code writing, continuous integration, test debugging, release operation, big data is ubiquitous in the whole process. Each data association has important value for discovery, measurement, and positioning in software quality. How to build a big data-based quality platform from 0 to 1 and use big data to improve software quality?
The legendary teacher from Alibaba Youku Division will share the Youku Big Data Quality Platform and online quality closed-loop solution practices at the QCon Global Software Development Conference on April 20-22. Before the start of the conference, we had the opportunity to interview the legendary teacher and learn about the technical story behind the construction of Youku's big data quality platform.
Interviewed guests
Wan legend (flower name "Wanzhong"), technical expert of Alibaba Youku Division. In 2014, he entered Ali and was responsible for the development of the Ali Group's continuous integration platform CISE and AONE Labs, supporting the testing tasks of all the divisions of the Group. And through the integration of the group test tool plug-in, middleware Dockerization and other core work, has accumulated a wealth of testing experience.
Beginning in 2017, it is fully responsible for the construction of the platform of Youku Quality Department, and establishes a video quality assurance system based on big data. It combines real-time measurement, monitoring, grayscale, alarm, positioning, analysis and other functions to form a set. The complete quality assurance solution has become the only standard for the Youku line of business and the related multimedia quality of Ali.
Platform construction background
With the continuous integration of Youku technology stack and Ali, all clients bury data in the way of referring to the group, but for the use of data, most of them are writing offline SQL, or some data docking groups use various horizontal service platforms. From the perspective of Youku's business line, there is no vertical big data platform to support each business line, which seriously affects the efficiency of development and the strong support that data should have for the business. Based on this background, the team was in danger and began the development of the big data platform.
"Pit" during the platform construction process
From a technical point of view, Youku's big data quality platform is divided into three parts: real-time, offline, and retrieval.
Real-time framework We chose Blink (Flink) and Spark Streaming. Both frameworks can handle real-time requirements very well. We chose Blink at the ETL level and Spark for the data calculation part.
The offline part relies on the ODPS platform. This platform is relatively powerful and suitable for newcomers to get started. Simple SQL can meet business needs.
In the search part, we mainly rely on ELK technology, and store the data in OTS (HBase) and ElasticSearch for real-time offline metric data query, including the above-mentioned aggregate query and full-text search.
In the process of platform construction, I encountered a lot of "pits". We also summarized some experiences, which are mainly divided into the following two points:
1. Cost
Before development, there are two cost issues to consider: cost and labor costs.
Big data is particularly resource-intensive. If this aspect is not controlled, the cost performance of the product will be greatly reduced. Combined with the experience of Youku big data platform, this piece must be strongly related to the business. For example, when data pre-calculation processing is needed, it needs to be considered. Selecting dimensions or required dimensions, or which maintenance can be combined, can save a lot of space on storage. In the offline calculation process, how to abstract the intermediate table, reduce the computational complexity, and save computing resources.
In addition to the labor cost, this performance is particularly obvious in the middle and late stages. With the development of the platform, the demand of the business side is constantly flowing in, and a series of developments such as docking data, data calculation, storage, back-end interface encapsulation, and front-end display are required from the link. Work, this requires us to specify the data format specification, the computational logic abstraction of each link, support flexible configuration work, etc. With the generalization as a premise, the big data platform students can focus on the optimization of the link architecture, business classmates Deep participation, which is very beneficial to the iteration of the platform.
2. Blind adjustment
Regular parameter tuning is something that big data engineers must go through. For the students who are developing big data platform for the first time, it is recommended that you do not blindly adjust the parameters. It depends on which part of the bottleneck has occurred. It is a network problem, a computing resource problem, or a data tilt problem. The parameter adjustment is targeted and the efficiency will be more fast.
Platform online quality assurance
There are several obvious stages in the testing field. Manual testing, automated testing, and continuous integration are actually more about pursuing higher quality and faster R&D efficiency. However, with the rapid development of mobile Internet, the quality requirements are much higher than the PC era, and the capabilities of testers need to be improved. It is necessary not only to meet the needs of routine development and testing, but also to pay attention to product effects and online operation and maintenance. Etc. That means the test field needs complex talents in the future.
We all know that the current mobile Internet products are very fast, and the testing of all kinds of devices should be covered. From the perspective of general testing, we should consider the APP startup time, page response time, page slip fluency, crash, Carton, power consumption, etc., the test cost is very high, and even most of the time back to the manual test to verify. So what can big data do for testing?
First of all, we bury the data of business concern, which can be function, performance, user experience, user behavior, etc., which ensures that the results of our tests are basically consistent with the user experience, releasing most of the routine testing methods, such as UI, performance, interfaces, etc.
Second, we divide the data flow into three stages: offline, grayscale, and online. We use the data of real equipment to ensure quality, and indirectly release the problem of insufficient multi-model testing. In the case of Youku playing the Caton indicator, the user sees a video waiting for the circle to start and end, which is a pause. At this time, the data is buried and recorded for the duration of the card and reported to the big data platform.
This big data platform can do all kinds of quality work on this indicator, such as:
-
How many times has the Carton and Carton average time in a play?
-
How much more than a carton will cause the user to quit the app
-
Which network is stuck in the network?
-
Is there a new version of the new version of Caton?
Corresponding to the functions of the big data quality platform, it is roughly divided into real-time metrics, monitoring alert, data analysis, location troubleshooting, and grayscale verification.
Monitoring alert
Traditional monitoring methods are relatively mature for server performance indicators and call links. Generally, abnormalities can be used to determine the cause. In the era of mobile Internet, the word quality is not only an online fault, but also an experience. If the problem that the user perceives is found to be untimely or not discovered, all efforts will be lost.
Therefore, our focus is on the client's embedded data, the buried point data related to the playback experience (cart, playback success rate), performance indicator data (startup time, Crash), key service return data (CDN node data) User behavior data (click behavior, stay behavior), etc. are classified to calculate abstraction to form CUBE, and to monitor the problem that can be reflected in the phenomenon to measure whether our quality is good or bad.
In the big data quality platform, it involves multi-dimensional computing, such as a drop in the success rate of a play, specifically in Android or iOS? Is it a national or a specific province? Is it a problem for all mobile users or Unicom users? This involves how we slice and drill the dimensions. If the dimensions are large, we find that the problem is not well positioned. The small granularity is a great waste for storage computing.
This requires a combination of business to define the required monitoring dimension, and then separate the wrong data stream through the ETL, and then drop it into the aggregation function ElasticSearch and Druid to further refine the dimension and set the alert from the "big face". Reduced to "small face". For example, Beijing Unicom has experienced a decline in the success rate of playback. Through aggregation, it is found that the error CDN IP is highly concentrated, and the alert level can be directly handed over to the network service location system for processing. In addition, monitoring has some explorations from the real-time, accuracy, and alert condition models. We will further communicate with you in QCon's sharing.
Intelligent analysis
Now major companies are doing Trace-related work, and Ali Youku Big Data Platform is no exception. Based on the original server-side log collection, the client buryes the log, the client remote Debug log, the service change operation, and the third-party service log (CDN, etc.). This operation is conducive to the unified data found in the problem; when the data is in hand, it is clearly told that there is a problem, how should we analyze?
First of all, if it is an error code, we can solve it one by one, but there are some problems, not caused by mistakes. For example, one day, we received a customer complaint feedback saying that watching the video special card, suddenly appeared, we checked the log without any error, and finally a careful classmate found that the user network IP in Beijing, CDN IP Was hit in Guangzhou. For this type of problem, two IP strings are extracted and matched by region analysis.
Second, we combine data to establish a positioning knowledge base, abstracting historical faults, online bugs, and badcase into our location diagnostic library.
Third, it is also something we are doing now. The knowledge base is built by people. In fact, this is like supervising learning. But we want to be able to position the problem in an unsupervised way. For another example, we will do some large-scale activities, but sometimes we find that the user conversion rate from the first page to the second page is alert (only 10%), we will carry this type of users full chain Road data retrieval (not just server-side logs), and then cluster analysis of various features, it will be surprising to find that most users will have common features clustered, the problem may be related to a service from Caused by the same server timeout, it may also be from the same client device because of page loading adaptation issues. Therefore, the future direction is focused on combining data and algorithms to mine greater value.