Start-up companies do data Analysis (ii) Operational Data Systems

Source: Internet
Author: User
Tags call back

?? As a second article in the series, this article will first explore operational data systems in the application layer, as operational data is almost the starting point for all Internet startups to start data and is the primary object of early data services. This article will focus on what we have done, what problems have been encountered, how to solve and realize the corresponding functions.

Early Data Services

?? Soon after the launch of the product, the backstage developers will often receive a private message from the operating colleague: "Can you check how many users have registered, from where?" ........”。 After a few times, the efficiency of this is too low: developers need to take time out of busy development tasks to do data query, statistics, and operations colleagues will have to wait a long time to get the data. So, we began to negotiate a better approach, and finally agreed: by the operation of colleagues to provide the required data template, the background developers according to the template to import data into Excel files, operations colleagues can analyze the statistics according to their own needs. This is the early data service, and its constituent structure is shown.


?? This is simple and clear, the background developers based on the data template to write a Python script, from the business database to remove the data, do some analysis, integration, output the results to an Excel file, and then send an email to notify the operating colleague to receive files. However, with the increase in demand and refinement, the increase in the amount of data, the problem of exposure more and more, here first listed, some of these problems will be presented in this paper solution, and some will be in the later article in succession to propose solutions.

    • Workers are more and more, distributed in many places, there is a lot of duplication of labor and code, some logic changes need to change a lot of files.
    • Because of the use of ORM to access the database, a lot of code only consider the logic, do not take into account the efficiency of query data, resulting in some reports need to run more than 10 hours to produce results (in the circular query data performance problems and optimization of the article is explained).
    • The results are lost, the data is not shared, and each worker runs its own logic to calculate it.
    • Woker relies on the crontab to control the trigger, without supervision, often due to dirty data caused by interruption, need to wait until the operation of colleagues found after the report to come to know.
Operational Data Dashboard

?? With the development of the business, it is not enough to provide data service in the form of data report. On the one hand, high-level expectations will be able to see clear data in the early morning, to clarify the recent operational results and trends, on the other hand, although the data report provides detailed data, but still need to manually filter, statistics to have the results, all want to see the data of the people need to do it again, The level of Excel is not aligned with the business people.
?? As a result, we began to plan dashboard systems to provide data visualization services in the form of Web. But what is dashboard going to do? Because product managers and designers are busy with the product business, they can only think about what to do and how to do it. Fortunately, I used Baidu statistics, the inside of some of the statistical services are relatively clear, combined with the company's business, formed a number of ideas:

    • The data content includes: The core indicator data and the chart analysis two parts. The former is based on a graph, to be able to quickly display the number and trend, such as the registration day Increment trend chart, the latter use a variety of charts to show the results of a period of time, such as the October TOP10 users interested brand.
    • data Types , including: C-terminal core indicators, B-terminal core indicators, core analysis and thematic activity indicators and analysis. The first two are for the C-end and the B-end of the indicator data, the core analysis is a number of comprehensive analysis, such as conversion rate analysis, thematic activities are targeted at a number of specific large-scale operational activities.
    • The Data dimension contains: Time dimension, City dimension, and B-end brand dimension. Time is the most basic and important dimension, the city dimension can analyze the status of each operation region, the B-terminal brand dimension is mainly for the business on the B end.

?? After finishing, it forms the mockup (simplified version), which basically covers the above ideas. Although the relative lack of aesthetics, but after all, is used internally, important data display to be accurate and fast.


?? Figuring out what to do, the next step is to put the idea on the ground and think about how it's done.

Overall architecture

?? The overall architecture of the system, as shown, is mainly based on the following considerations:

    • front and back end separation . The front end is only responsible for loading charts, requesting data and displaying it, without any data logic processing, and the backend is responsible for generating data and providing rest APIs to interact with the front end.
    • offline and real-time computing coexist . In order to improve the speed of data acquisition, the curve indicator data is taken off-line to provide the historical data for the front-end display; The chart Analysis class data is calculated in real time, and its speed depends on the amount of data in the selected time period and is cached if necessary.


Front-end implementation

?? The front end of the dashboard system is not complicated, and it is mentioned that we will not do too much work on the style, focusing on the display of the data. So the first thing to do is look for a chart library. The author here chooses Baidu Echarts , which provides a rich chart type, visualization is great, mobile support is very friendly, it is important to have detailed examples and documentation. It turns out that Echarts is really strong and well-satisfied with all our needs.
?? With the chart library selected, the next question is how to gracefully load dozens of charts or more. This requires finding where the graphs show commonalities (behaviors and attributes) and abstracting them. In general, using Echarts to display a data chart requires four steps (Official document): The first step, the introduction of the Echarts JS file; the second step is to declare a div as a container for the chart; The third step is to initialize a Echart instance, bind it to the DIV element, and initialize the configuration item; Fourth Step, Loads the chart's data and displays it. It can be found that the behavior is mainly divided into initialization and update data two, the property is mainly the initial configuration items and data.
?? Based on this, the author uses the idea of " Pattern+engine " to implement the front-end data loading. First, you use JSON to configure each chart in JS, which is write pattern. For example, the following configuration corresponds to a chart, ElementID is the id,title of the chart, names is the graph's curve name, and the URL provides the api,loader that gets the data to represent the chart engine to be loaded. A chart of a page consists of a set of such configuration items.

{        elementId: ‘register_status_app_daily‘,        title: ‘App注册统计(日增量)‘,        names: [‘用户数‘],        url: ‘/api/dashboard/register_status_daily/‘,        loader: ‘line_loader‘}

?? When the page loads, the corresponding loader engine instance is generated according to the configuration item in pattern, which is used to initialize the chart and update the data. Each loader corresponds to a echarts chart type, because different chart types have different methods for initializing and loading data. The program class diagram is shown below.


Back-end implementations

?? As mentioned earlier in the early data service, there is a lot of duplication of labor and code, so in the backend implementation of the dashboard system, I began to consider building a public library of data analysis, which occupies a large part of the workload. The underlying public library is not targeted for any special business needs, is mainly responsible for three things: first, the encapsulation data source connection method; Second, the generation method of the package time series produces the time series that takes days, weeks, and months as intervals; Thirdly, the data query, cleaning, statistic and analysis method of package basis, form the formatted data, this part is the most important.
?? Having completed the construction of the underlying public library, the entire code structure was refreshed a lot. On the basis of this, we start to build the analyzer on top. Analyzer is used to complete specific data analysis needs, each analyzer is responsible for the output of one or more data indicators, each Graph/Chart data by an analyzer to be responsible. Offline computing and real-time computing, the corresponding analyzer is called to complete the data output, respectively, under the trigger of schedule and Web request. Therefore, the entire backend system is divided into three layers to implement, as shown in.


?? Finally, we talk about the problem of offline data. Currently off-line calculation is triggered by schedule, 0 points a day before the data, the data according to "each indicator in different dimensions per day a data point" principle to generate, by the above-mentioned analyzer is responsible for the output of formatted data, deposited in MongoDB. Because of the simple query rules, it is only possible to set up a composite index to solve the efficiency problem. At present, the amount of data is around 500W, there is no performance problem, later, we can consider migrating some historical data, of course, this is something.

Data Reports

?? Dashboard on-line, we began to consider the early data Reporting Services to gradually stop, reduce maintenance costs. And the operations colleagues want to continue to keep some of the report, because although dashboard provides a lot of data indicators and analysis, but some work needs more detailed data information to do, such as to bring the registered campus agent settlement pay, new registered users call back and so on. After a few carding and negotiation, and finally retained six data reports. On the other hand, businesses on the B side expect to be able to export their own data in the background. Combining two aspects of demand, the author constructs a new data reporting system.


?? The new data reporting system is divided into three parts according to the process: triggering, execution, and notification. The internal data report is still triggered by schedule, initiating the corresponding worker process to execute, while the report provided to the external is triggered by the Web front end through the rest API and the corresponding task is added to the Celery task queue for execution. The execution body is completed by a set of exporter, exporter is responsible for acquiring data, generating a data format suitable for writing to Excel, writing an Excel file, and the data acquisition part relies on the underlying public library described earlier. Finally, mail notifications are sent uniformly.
?? In consideration of the problem of generating report failures frequently encountered in early data service, the author made two points related to exception in the new Data report system:

    • Use airflow to monitor schedule-triggered tasks (as detailed in subsequent articles), manually triggered tasks are monitored by celery, and email notifications are sent to developers when they encounter exceptions.
    • If an Excel data file consists of multiple sheet, when an exception occurs in a sheet, it is usually handled by two methods: one is to discard the entire file, and the other is to keep the other sheet information to continue generating the Excel file. In this case, the internal report uses the second approach, the external report is relatively rigorous, and the first is used.


?? The above is the author of the company's operating data system development process and status quo, the current dashboard and data report two systems have stabilized, the basic provision of more than 90% of operational data Services. Of course, with the increase of data volume and the development of business demand, we will face more new challenges.



(End of this article, address: http://blog.csdn.net/zwgdft/article/details/53467974)
Bruce,2016/12/07


Start-up companies do data Analysis (ii) Operational Data Systems

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.