The high accuracy of the World Cup forecast of Baidu Big data forecast in the steady progress encountered a small obstacle, still inside the box office prediction of the "Golden Age" and the actual results of the forecast deviation, by the media long report caused the industry's high concern, the author has been observing large data forecasting business, Baidu forecast "Golden Age" There are some opinions about losing, not to vomit.
First, look at the big data forecast inaccurate to be calm and objective
In recent years, the term "big data" appears frequently in various media, and various industries and products related to large data are booming. This February, Zhongguancun Management committee in the "speed up the cultivation of large data industry clusters to promote industrial transformation and upgrading of the views" said the conference, by 2016, zhongguancun large data-driven industry will be more than 1 trillion yuan, which is just the number of zhongguancun large data industry, looking at the world, large data in the future "money" very impressive. Although the concept of large data has been hyped, the various products related to large data are still in the initial stage of exploration, for example, the use of large data to make predictions, although Baidu forecast before the World Cup forecasts, Golden Week travel forecasts and other products show relatively high accuracy, but for the "prediction" itself, the phenomenon of misalignment is actually very normal.
In the case of the box-office forecasts of the Golden Age, let's take a look at Baidu's official explanation. Baidu's response to the media did not boil down to "engineers making mistakes", but directly pointed out that the core reason: because the movie market in China, the history of literature and art film box office data is very small, so in the "golden Age" in the prediction of the use of a general model and not for the literary film alone modeling, resulting in the final result deviation.
With Xiao Hong, Republic of China, literature and art of these labels, "golden age" is relatively small film, facing the audience is not the mainstream crowd. Any data on such films is few and far between, and there is no predictive model of the corresponding type of film available for reference. In predicting the Golden Age, Baidu adopted a model of universal cinema, lead to a large deviation, the future if you want to predict accurate, the best solution is definitely for the different types of film modeling alone, and I understand that there is still in the beta stage of the box office forecast has been in this area of improvement.
Browse Baidu Forecast platform (trends.baidu.com), Baidu box office forecast icon is gray, and not officially on the line, on the contrary, economic indicators, diseases, attractions and the forecast of the tournament has been fully online put into use. Baidu box office forecast model needs to be further improved, more parameters need to join the model, such as film properties, film length, row size, field average fares and other all-round dimensions are taken into account.
But from another point of view, I think, even if Baidu is officially on the line after the launch of the "prediction error", but also very normal, no one who really owns the crystal ball, large data prediction can not determine a certain thing will happen, it is more to give a probability, human only constantly to approach this one probability. The premise of the prediction is to acknowledge the existence of uncertainty. Uncertainty varies greatly in different areas. box office, the stock market is more susceptible to human impact of the existence of strong uncertainty in the field, the forecast is more difficult than the weather, tourism, transportation, prices and so on.
Because a "golden age" predicted defeat, it is unreasonable to question the big data forecast itself, or the box office forecast itself. Baidu's relatively beautiful forecast during the World Cup, during the Golden Week, has proved the value of big data forecasts, but it needs to be more patient to optimize the new area of the ticket-forecasting room. So is the box office forecast really ineffective in China?
Second, the essence of the prediction lies in sedimentation and rectification
Why does big data not work in predicting the Golden Age box office? The core points of the article are listed as follows: 1, China's box office data precipitation is too small; 2, some man-made data to the box office forecast interference; 3, the prediction model is in the primary stage, the variable omission and the sample deviation; 4, the theater manager forecasts the reliable, the box office forecast does not have the significance, the movie forecast talks the big data to
For these views, only the 3rd I agree, this is an objective fact, Baidu also admits in the beta stage of the box office prediction model is still to be perfected. But if you ponder, you will find that there is no perfect prediction model in the world, each field is, the next second things will be affected by a number of variables, some variables can be taken into account in advance, some variables, even if taken into account is difficult to monitor, variable omission and sample deviation is always the prediction problem, Only by constantly updating the variables, correcting the samples, and upgrading the model can the Predictor keep the prediction close enough to be true.
Which industries will the big data forecasts change? In the article, the author sums up the logic basis of large data prediction is that every unconventional change must have a sign beforehand, every thing has traces to follow, if found the law between the signs and changes, can be predicted. The two points that are critical to the forecast are: the rules derived from past data and experience, which map to the predictive model, the "change" that can be monitored in real time, and map to variables or real-time data. The difference between large data prediction and traditional prediction lies in: More timeliness, new data source, dynamic prediction and regular dependence.
The negative attitude to the box office forecast first boils down to the data: the film data is too small, the network data is not good, and the problem is dirty data.
1, precipitate too little is unfounded.
It may be an objective fact that China's box-office data are too small to precipitate. But the reason for predicting the huge amount of historical data is to find the law. But if only 100 years of box office data, but not with the impact of these box office data, "variable" data, in fact, the mining law does not help.
An example is the Baidu in the World Cup forecast and third-party data companies to cooperate to obtain a large number of historical data mining, the team, players, venues and other static factors into consideration in the same time to introduce public opinion, European compensation Index and other dynamic variables, and finally achieve close to accurate forecasts.
For the box office forecast, even if the Chinese 80, 90 's box office data, rather than "predict the relevant data", for the box office law to obtain no help, there is no internet, the film market has long been unrecognizable. What data does the box office forecast really need? No one can tell us the answer. It is not realistic to wait until 10 when the data accumulates completely again to talk about big data predictions. Because we don't do it today, people don't know what data to collect or record. And who can point out what the difference in time between 10 and 2 will have on the accumulation of data?
The data source advantage of large data prediction is that it can record data in a more comprehensive and timely manner, and collect data that could not be collected in the past such as user's demand, public opinion, mood change, or travel rule, movie ticket price, cinema scheduling data. So instead of worrying about "the lack of traditional data", it's better to think about what data the box office forecast needs, and how can it improve the law?
2, data and dirty data is an eternal problem.
Network data is the entire internet has to face the data gap problem, no one has a full network of data, aggregation of data to predict the entire network is almost impossible to complete the task, and this is not necessary. If social-networking data are important for forecasting, China's only Tencent is likely to make predictions – not really. Ali Index has become a distributor of electricity sales vane, Baidu Search index for all walks of life also has an important reference significance, because it represents interest. Each master of the data is different in nature, but it is possible to work together to get more dimensional data, and ultimately improve the reliability of the prediction, but it is unrealistic to direct the data barriers to the home.
Similarly, "dirty data" and "noise" is the whole internet forever phenomenon, even if the traditional sampling research will inevitably encounter noise samples and then be disturbed. The answer to this problem is to filter the noise data as much as possible, taking into account that the noise is constantly correcting the model and increasing the error range of the predicted result. There is also the assumption that if dirty data is positive for the results (such as making the box office better), dirty data will negatively affect the results.
Baidu search results do not rule out someone for the operation of the data, the Navy review, watercress score everyone knows, but the article mentioned the commercialization of Baidu is not a dirty data, because Baidu to eliminate the impact of commercial advertising easy, and these data on the forecast is very valuable, Part of the Google box office forecast model is based on ad-clicked data.
3, the theater manager is not forecast, but affect the box office.
The theater manager can actually predict the box-office results of a movie at a cinema. If they master the options, they can even directly influence and decide the local box office for the movie. All theater managers will eventually have a huge impact on the overall box office. This is not a causal relationship, but a link: Theater managers in the prediction of the box office also affect the box office.
We can correspond to the theater manager to the stock market shareholders, shareholders of their own concern about the stock price expectations, based on this expectation to lighten or overweight and other operations. The game of all shareholders ultimately determines the volatility of stock prices. But this does not mean that shareholders are the best stock forecasting experts. In tourism, transportation, housing prices and other fields have similar situation, the participants based on individual prediction, or third-party prediction results to action, and thus affect the results.
The point here is that it is not appropriate to put the participants and the forecasters together, and that the participants are very important dynamic variables. The "golden age" of such a dismal box office is a large part of the box office manager to reduce expectations and reduce the row. However, Baidu in the future with the cinema or box office Manager can improve the accuracy of the forecast, on the one hand, the online upgrade model, on the other hand, the box office manager's schedule into the monitoring range, the Baidu Data + engineers of large data prediction for the package-style box office forecast, it is possible.
The last thing I want to say is that because a movie's predictions fail to negate big Data box office forecasts are really debatable, weather forecasts are constantly being upgraded to achieve today's accuracy and refinement, but there are still times when I was angry when it came to my life that the weather forecast would not be considered, but everyone knew that it was not. box office forecasts are just beginning, and perhaps more inclusive. In the long run, through constant optimization, if the box office prediction products can eventually achieve a certain degree of accuracy, then for the entire film industry will provide a very important reference value, such as the investors, the shooting party, the promotion side to provide more accurate data reference, so as to guide their promotional, plot settings and even the selection of actors and other parties To make more accurate and advantageous judgments.
Author Weibo @ Internet Chiu, micro-letter Supersofter