"Case Background"
The software development process has experienced three stages, from waterfall development, to agile, to lean, different research and development concept behind the quality has different demands. In the 890 's, when the hardware performance was based on Moore's law, the high-level language appeared, the popularization of microcomputer and the explosive growth of software demand. At that time, the practice found that faced with complex large-scale software engineering, often the project failure rate is high, so there is a strong demand for software engineering theory, waterfall development model embarked on the historical stage. In this mode, to the Test team to deliver the perfect Match Specification (specification) of the software products, the core of the problem is to be solved at a lower cost and efficient coverage of the test point, so in this era of various test design methods have made considerable progress.
Then with the ups and downs of the Internet in the 00, Internet companies in order to quickly respond to the changing demands of users, the core requirements of software engineering changes, from the strict according to the drawing of the waterfall-type development, in turn to the pursuit of quality and efficiency balance. In this period, "agile mode" big line, in order to achieve agile and continuous integration, in order to practice continuous integration and automation, most of the Test team in Fengfenghuohuo to automate the transformation, a variety of tool platforms emerged, until today.
Agile based on the concept of lean, built from the product to the user's closed-loop, the so-called small steps to run, the product needs to be the first time the user to see, collect data and views of users, and constantly revise the product form. An overseas meeting to hear a very interesting words: If you release a product that doesn ' t embarrassed you, it is meansthat you could release sooner. (If the product you're publishing is perfect, it doesn't embarrass you, it only means the product was released too late). The lean model is particularly useful in the Internet industry where users vote, and the famous Facebook product is developed in a similar way. Collecting feedback on what's new and incorporating feedback into the product iteration is an increasingly important thing for internet products.
Curved around to repeat the lengthy background, let us cut to the chase. Access to user feedback, in addition to starting from the user behavior analysis, the active collection of user complaints (customer service channels, product feedback form) and passive access to user feedback information is an important complement of collaborative product iteration. This share tells about the process of establishing the feedback system and quality closed-loop of Baidu, and describes the main points and details of the structure of the third party's public opinion feedback. We expect to provide reference experience to companies and teams with similar demands.
"Solutions Overview"
The whole project structure is divided into three parts, namely data fetching, data cleaning and data output. The difficulty of data fetching is to deal with the format change and shielding strategy of third-party data sources, and to ensure the timeliness of data flow in the changeable internal and external environment is a more difficult aspect in engineering practice. Data cleaning mainly solves the essence of fetching feedback data purification, how to balance delay, quasi-call, cost, this part of the topic needs to be discussed. The problem with data output is to balance the relationship between the good one infrastructure and the business, and we see a number of similar systems that are ultimately bogged down in a variety of diverse needs, and here we give some ideas for solutions. See the overall architecture.
First, data capture
Third-party public opinion data sources include: Weibo, Baidu Paste, the major application store, news, search, forum, the main design to consider the following three aspects of demand. 1) Response to multi-data source format changes, low-cost maintenance; 2) responding to emergencies and needs, can quickly and extensively backtracking; 3) The masking strategy of the data source can be adjusted adaptively.
Baidu main search engine, has a deep data capture accumulation. We can easily get the DOM structure of any node in a Web page by using the node location feature of most browsers. A single crawl process should be able to quickly extract the desired element information with the structure as a configuration, so that we can crawl arbitrary content on any page with a unified program, via parameter passing. Each crawl process will be assigned the list of links to be crawled by the "Scheduler (Scheduler)" with the corresponding configuration, working independently, with no interference, and with its own delay and exception handling. When a crawl process does not return for a long time, or fails to return, the dispatch side will retry the new crawl process where the task is assigned to the other IP segments, and still fail until the final success, or multiple attempts. The masking policy of external data source and the upgrading of webpage structure often lead to crawl failure, at this time, the system should detect the problem in time, and intervene by human intervention. Due to the use of standard configuration items, the intervention only need to determine the crawl failure of the page can still open, relocated to the changed element node, test configuration and re-crawl, this work can be without technical experience of product personnel, interns, outsourcing to bear, greatly reducing maintenance costs. In summary, is a brief description of the crawl logic, also incidentally answered the first question, through the design of decoupling and isolation, a large number of new and changed data source maintenance is simple and fast.
A more notable part of the design is to deal with two types of anomalies, burst throughput, and data source masking. In many cases, the data source crawl task backlog, may be an important business should ask the boss to back six months of data, may be a product during the activity of the volume of data burst, may be a system failure or trigger the mask caused a lot of crawl task delay. Here, we describe how the "scheduler" section works. "Scheduler" under the boundary limit of grasping system, allocate the resources of the system, mainly is the resources of the bandwidth resource, the IP resource, the account (some crawl need login state). In the task-intensive can not be fully completed under the premise of prioritizing high-priority task queue, and in the queue after the completion of the gap, fill the sub-optimal task, and so on. For example, a high-priority task needs to crawl a data source, but limited to crawl frequency, bandwidth is still surplus, this will be based on the amount of surplus to fill the remaining tasks, so that crawl throughput as saturated as possible. In addition, the system should support capacity expansion to fill new hardware resources and rested temporarily shielded IP segments.
Then talk about the shielding strategy related content, usually the site to an account or IP block main cause is nothing more than to protect content and anti-attack. Simple forbidden policy, just a count merge, in one minute, five minutes, 15 minutes, an hour time window, the same account or IP address access more than a certain threshold, is often considered non-user behavior, this time will guide mobile phone number verification and other actions, if the crawl system is not fault-tolerant still insist on mechanical retry, will be permanently banned seal, Need to be lifted after manual communication. The more complex strategy is the offline point of view analysis, a numeric value, can be the variance of the access interval, can be the variance of the number of visits per unit of time, can be the distribution of the Access link type, through statistical methods, identify outliers, randomly screened authentication. The former we will do an isolated test resources, the new data source gradually approaching the upper limit, reverse the policy configuration, so that the capture is efficient, the latter is not a good way, but fortunately the vast majority of data sources do not use this overly cumbersome way. The solution to the masking strategy is that the unfamiliar data source needs to be quarantined for a period of time to avoid affecting the IP segment that the main system crawls, and the detection is blocked in time to stop the crawl to avoid further black; Prepare multiple sets of filings in advance to ensure that the data is in place in all situations.
Finally talk about the filing strategy. We will set aside enough unused IP segments and accounts for a timely replacement when blocking. The ultimate record is the use of public testing, after the user client crawl back. "Public testing" provides clients with the power to open shared resources when using a PC or phone, and to pay for the use of resources. A version of our crawl terminal can be run on any user device, crawl frequency and access mode is difficult to be captured by the blocking policy, the only problem is that this way to obtain a higher cost of data, but as a record is appropriate. Looking back at the initial design, the grab end and the dispatch separation can work independently, showing the design advantage here.
Second, data cleaning
Data cleansing is divided into three stages: data filtering, data correlation, data emotion tagging. There is also a part of the convergence of ideas, because more complex, not here to unfold.
Data filtering is mainly for all kinds of marketing copy shielding, especially a lot of products or activities, will encourage users in the design of "Weibo" and other media to leave a share traces. These traces are not natural user feedback, and a large number, will interfere with public sentiment analysis. Converted to algorithmic problems, there is a large list of strings that find substrings in this string that appear more frequently than a certain threshold. The time complexity of this algorithm is obviously non-first order, not extensible, our optimization scheme is to extract the template after sampling clustering, and then use O (N) method to filter the whole text. Another kind of filtering is relatively simple, filtering out the number of words less "top", "like" this kind of irrigation.
Data correlation is mainly to eliminate the two-meaning data, such as Baidu glutinous Rice Group, referred to as glutinous rice group, is an important strategic product of Baidu, but with snacks "glutinous rice Regiment", as well as the nickname of a star Baby "glutinous rice", is coincident, need to classify according to the context. The part of affective analysis mainly involves the recognition of irony and contextual recognition, here are a few examples. "Baidu Map can be reliable point?" "This is obviously a negative emotion, but it may be that the product name (Baidu map) and keywords are identified as positive emotions. "Baidu News said, there is a large area of East China computer room failure", this and Baidu news this product is irrelevant, but may be mistaken for the key word (room fault) as negative information. Machine learning in the word, Word segmentation, text classification, is originally an independent field, here does not expand the introduction of algorithms and strategies, but mainly to support the data cleaning engineering structure.
Data from the crawl after the original data storage, the system automatically processing, processing of intermediate data, will be randomly extracted Fook sample data for manual sampling. Each data source has a "rate of call" threshold for each product pattern. When the data quality is higher than the threshold, it will go directly to the next link until it is consumed by the user, and when the data quality is low, triggering another branch will initiate the manual labeling of the full amount of data and feed it back to the machine learning model training with manually labeled data to improve the quality of the next round of data cleansing. The line of business only needs to set the acceptable accuracy lower limit, from the data generation to the model of self-optimization training, is an automatic process, which greatly reduces maintenance costs.
Third, data output
The data is cleaned and entered into the "Feedback Data Warehouse". Data Warehouse is a logically multi-key-multi-value storage, mainly used to filter the conditions have time, product lines, data sources, whether relevant, there is no emotional inclination, whether after cleaning. These conditions as key, under any combination, can get feedback content in real time. In addition to the data in the warehouse from the third-party data sources, but also a part of the product embedded in the feedback, the product through the call API or SDK embedded in the function, the feeder process will interact with the user to proactively obtain user feedback.
The reason why the Data warehouse is bridged after cleaning is for the following design considerations. On the one hand, the systems we build should stay in the service of infrastructure or tools (infrastructure), rather than in-depth custom development for some businesses, which requires a clear line between the building of technology capabilities and the business requirements. The full structure of the data in the warehouse facilitates two processing and flexibility, making it an architectural choice for individual business needs. On the other hand, the technology selection of Data Warehouse is advantageous to fully integrate the company's existing data and technical resources. For example, in order to estimate lost profit and loss, the loss rate of users wishing to exploit negative feedback, which needs to be merged with the product's own user behavior log, the Data Warehouse is the most convenient tool to meet such requirements. And the warehouse behind the perfect monitoring, operations, BI, Ad-hoc query and other mature tools system, easy to reuse.
Warehouse, we built a real-time index to publish the data that was marked to complete the cleaning incrementally. Downstream users can access the standard API, specifying return items, time windows, product lines, data sources, emotions, tags, keywords and other information to get feedback. Because of the content text of the feedback inverted index, users can easily "Baidu pictures" + "pornography" search combination, access to users in a certain period of time in Weibo, paste, forum and other channels of relevant feedback, very convenient. Based on this convenient form of access, in addition to the standard platform we provide, there are many platforms based on the two development of this data, forming a quality closed-loop ecosystem with feedback data as the core. Some of the scenarios are described in the next section.
"Application Scenario Brief"
The third party public opinion system is mainly aimed at three kinds of scenes: public opinion monitoring, competitor analysis, problem recall. These three kinds of scenes are layer-by-layer relationships, which are illustrated for each of these scenarios.
Public opinion monitoring is the most understandable application scenario, such as the frequency of Baidu search traffic hijacking this summer, such problems when the impact of low traffic is difficult to find on the server side. When the user's negative feedback on the issue increases significantly, we can find this problem through third-party feedback data. In addition, public opinion monitoring is commonly used to understand the stage state of products, such as glutinous rice in the 20 billion hit the market to do a few discount activities, because it is a one-time activity without historical data, public opinion feedback is a powerful measure of effectiveness.
Competition analysis is to identify the product and market main Line products Competitive advantage disadvantage, this is the third party public opinion collection advantage; After all, feedback data embedded in the product line cannot overwrite third-party products. For example, through the user's negative feedback ratio, we found that compared with the United States, Baidu, the refund process of glutinous Rice group has a higher proportion of negative issues. These conclusions are important for product decision making.
Finally, following the previous section, the data we produce supports a closed loop of quality issues for many lines of business. Often found that users quickly spread some of the wrong search results, such as "no" search, there will be a lot of pornographic image results, these problems are difficult to detect in testing and monitoring, feedback is a good weapon to deal with. Before these issues are disseminated, our systems can be captured in time and delivered to the product line for processing. The process of diffusion, discovery, and processing by the user can be as short as the hour level. Baidu has a full-time wind control, marketing, brand category departments, each line of business also has a product manager responsible for operations. Our data provide a strong technical guarantee for these jobs.
Summary
The Test team's primary concern is quality. Most of the practice and sharing we hear is focused on helping the team to improve the quality of research and development process quality, and the "Product quality Improvement" part of the team did not begin to intervene. It is a new direction for the Test team to collect the third party public opinion, provide analysis, facilitate the corresponding iteration, and construct the quality closed loop. Hope that sharing can lead to more collisions and thinking.
Baidu will soon launch a technical book, "How to efficiently develop a high-quality mobile app" (name to be determined), this will be Baidu's first technology output, the topic focuses on the mobile Internet, content coverage app development, deployment, testing, distribution, monetization, monitoring and data analysis of the whole process. Help mobile app developers to better understand Baidu's leading technology, project experience and independent research and development tools.
As the industry's leading mobile application one-stop testing service platform, Baidu MTC covers the entire life cycle of mobile applications from development, testing to on-line and operation, providing solutions to the cost, technical and efficiency issues faced by developers in mobile application development testing. The manuscript will be published in the MTC Institute (HTTP://MTC.BAIDU.COM/ACADEMY/ARTICLE), synchronous coverage of other technical forums, and will be assembled in the first half of this year, the official publication of the issue, please look forward to it!
>> If you have any questions, please feel free to communicate with me
Third party public opinion collection and quality closed-loop construction