Tencent recently hosted a summer meeting with big data. About Big data, everyone hype more is the opportunity, power, for example, more and more people use Google's big data to research trends, aid analysis decision, but this one share a good thinking from another perspective: Big data can also be "big cheat".
A recent study of Google Flu Trends (Flu trends) is a testament to this.
When it comes to Google Flu Trends, first mention Google Trends and Google correlate. Google Trends uses a large data analysis of user searches to gain some of the trends in human activity by entering certain query keywords on google Trends to return a sequence of related activity data. And Google correlate is the input data sequence can return a group of results in a similar pattern (relevance) of the query, somewhat similar to the Google Trends inverse function.
Google Flu Trends is one of the earliest and most well-known applications of Google Trends. Since many people suffer from the flu because they often go to Google to find out about the disease and medication, Google has found that there is a correlation between this kind of inquiry and the flu outbreak. Google Trends has had several successful predictions of influenza, including 2011/12-year American flu, 2007/08 Swiss flu, 2005/06-year German flu, 2007/08 Belgian flu, and even higher in time than the US Centers for Disease Control and Prevention.
This shows the correlation between the search for "flu" and the flu outbreak.
Another example is "Hangover". For example, when you enter "Hangover" in Google Trends, you will find that the situation began to show up in Saturday, then peaked in Sunday and dropped sharply in Monday. This pattern is similar to the result of a query entered "vodka" (one day behind).
But the bigger the data, the higher the forecast rate is not necessarily. It may even lead to "false rules" and "pseudo correlation". For example, search for 2004-2012 years of U.S. car sales and "Indian Restaurant", the results found that there is a correlation between the two. This thing is obviously unexplained.
What is the cause of the pseudo correlation?
First, relevance does not imply causation. Google Flu Trends, for example, does not always predict trends. A few times, Google Trends the number of flu cases, including 2011/12 of the United States flu, 2008/09 Swiss flu, 2008/09 German flu, 2008/09 Belgian flu.
Researchers at University College London studied this. It turns out that people who go to Google for "flu" can be divided into two categories, one for the cold, the other for the copycat (probably because of the media coverage and interested in the subject of the cold).
It is obvious that the data of the first class are useful. Its search is internally generated, independent of the outside world. So the search patterns of these people should be different from those of the people who are being searched for outside influence. And it is the second type of people's social search that makes Google Flu Trends predictive distortion. This is precisely because Google Flu Trends the correlation between the search for "flu" and the flu as a causal link.
Yet another group of researchers at Northeastern University and Harvard University in the case of Google Flu Trends's distortions suggest that this reflects a big data-cocky trend that was born in the context of big data. The trend of thought is that large data can completely replace the traditional methods of data collection. The biggest problem is that the vast majority of data differs greatly from the data obtained through rigorous scientific experiments and sampling designs. First, the big may not be all;
In addition, changes in Google's search algorithm may also affect the results of Google Flu trends. This reason is not difficult to understand. You know, Google's search has been tweaked very frequently, with 890 improvements in the past year alone. Many of them belong to the adjustment of the algorithm. Media coverage of the flu epidemic will increase the number of flu-related words searched, and Google will also increase the search recommendation. So that some people who do not have a cold are also interested in the flu, and then dirty the data.
How to clean the data? In the final analysis, it is necessary to analyze the data. In the case of flu trends, the researchers believe that the pattern of people with flu that performs independent searches is different from social search over time. The performance should be a sharp rise in the search for flu outbreaks and a slow decline as the flu disappears. On the contrary, social search is more symmetrical. The data suggest that the trend curve is indeed more symmetrical when Google flu trends are overvalued.
This indicates that such traps must be noted in the analysis of large data. The flood of large datasets and statisticians ' dissemination of the analysis results can magnify or contaminate real data.
As the author of the Parable of Google flu:traps in large data analysis argues, the value of data is not just in its size. The use of innovative data analysis methods to analyze data is essential.
Of course, when future data can become truly big data, and the digital world maps to the physical world, large data may be able to exert its full power and change the way we solve the problem.
(Responsible editor: Mengyishan)