The big data is a vague and ambiguous term used to describe a large-scale phenomenon that is now rapidly becoming the focus of entrepreneurs, scientists, governments and the media.
Big data is compelling.
5 years ago, a Google research team published a remarkable study in the world's most famous science journal Nature. Without any medical test results, the team was able to track the spread of flu trends across the United States at the time, and it was even faster than the US Centers for Disease Control (CDC). Google's tracking is only one day behind the flu outbreak, and the CDC has spent a week or more to gather a flu-spread trend map. Clearly, Google is faster because it successfully tracks the spread of influenza by looking for "online search" and the correlation and regularity of "people with flu".
Google's flu trends are fast, accurate, low-cost and do not require any theoretical support. Google's engineers are not in the mood to develop a hypothetical theory of what terms ("flu symptoms" or "near my nearest pharmacy") might be the key to the disease itself, instead, Google's team singled out 50 million of the top entries, and then let the search laws operate on their own, resulting in results.
Google flu trend has become the business community, technology sector, the scientific community has a representative of the success of the case: "Big data." Then reporters will be excited to ask: the scientific community can also learn from Google?
As with many popular words, "Big data" is a vague and ambiguous term that is often picked up by people who have something to sell in their hands. Someone specifically mentions the size of the data set, such as SCM Hadron Collider's computer, which can store 15,000 megabytes a year, roughly equivalent to the data that your favorite music played for 1500 years.
"Big Data", which attracts the attention of many companies, can actually be called "found data", which takes place in Web search, credit card payments, cell phone sensing to the nearest phone signal platform. The Google flu trend is based on data that has been found, and that is what attracts us here. Such data sets can be even bigger than the LHC's data Facebook. It is worth noting that the collection of these data is actually very cheap relative to these large sizes. Random tiling of data points, collected for different purposes, and can be updated in real time. Modern society as our communication, leisure and business activities are transferred to the network, the network gradually migrated to the mobile network, living in a way unimaginable 10 years ago, recorded and quantified.
Proponents of big data come to the following four conclusions, each of which is in the "Google Flu trend" success story:
1. Data analysis produced astonishing and accurate results;
2. Each data point can be captured, which makes the past statistical sampling technology appear very outdated;
3. The reasons behind the data seem outdated, as the relevance of the data tells us what we need to know;
4. Science or data models are not needed.
While the big data has shown so many bright prospects for scientists, entrepreneurs and governments, these four theories are entirely out of the most optimistic and simplistic perspective, and if we ignore some of the lessons of the past, it is doomed to disappoint us.
Why the big numbers are disappointing
4 years after the publication of the Google Flu trend forecast, a new issue of Nature magazine reported bad news: Google's flu trend did not work in the latest flu outbreak. Over the past few winters, the "Google Flu trend" has provided a series of quick and accurate statistics on flu outbreaks. But I don't know when to start, the lack of theoretical basis, so that the data-rich model gradually lost its sense of the flu's sensitive sense of smell. Google's model data show a serious flu outbreak, but when the CDC eventually arrives at a slow, but still reliable, data delivery, the data suggest that Google has exaggerated nearly twice times its forecast for influenza disease transmission.
The problem is that Google doesn't know, or even knows, what causes the search term to be linked to the spread of flu. Google's engineers are not trying to figure out why, they simply look for patterns in the data. They care more about the correlation between the data than the antecedents and consequences. This is quite common in large data analysis.
It's hard to make sense of the cause and the consequences (it's almost impossible, some people say), but figuring out which data is interrelated is cheaper and easier.
That's why Viktor Mayer-schönberger and Kenneth Cukier wrote in their book Big Data, "the exploration of causality in large data analysis is not discarded, but it is gradually withdrawing from the main cornerstone of data research".
An analysis that has no theoretical support but focuses on data correlation must be fragile and untenable. If you don't understand what's behind the façade, then you don't know what causes the relationship to rupture. The explanation for the failure of the Google flu trend may be that the December 2012 news is filled with sensational stories that inspire interest in online search for healthy people. Another possibility is that Google's own search method is unpredictable, and when people enter information, the system automatically prompts for diagnostic information.
Statisticians have been working for the past 200 years to figure out what blocks us from simply understanding the world through data. Although the amount of data in the world is larger and spreads faster, we cannot pretend that the traps in the past have been handled safely because they have not disappeared.
In 1936, Republican Alfred Landon participated in the presidential race with President Franklin Delano Roosevelt, a respected and prestigious magazine, Literature Digest, which shouldered the responsibility for predicting the outcome of the election. The magazine launched a postal poll aimed at sending tests to 10 million people, a figure close to 1/4 of the real electorate. It's hard to imagine a flood of replies. The magazine also enjoys a wide range of tasks. At the end of August, the report said, "Next week, the first of the 10 million votes will begin the first round of candidates, with three Tests, verifications, five cross classifications and summaries." ”
After counting the staggering 2.4 million votes withdrawn in two months, the Literature Digest magazine finally released its findings: Langdon will win the general election with 55% to 41%, with a minority voting in favour of third party candidates.
But the campaign ended up with a very different outcome: Roosevelt's victory over Langdon by 61% to 37%. To add to the gloom of Literary Digest, a small survey carried out by the poll pioneer George Gallup, with the result of a very close poll, succeeded in predicting that Roosevelt would win easily. Thus, Mr. Gallup understood something that the Literary Digest magazine could not understand: when it comes to data, size does not mean everything.
Generally speaking, polls are based on the sampling of voters. This means that pollsters usually need to solve two things: sampling errors and sample deviations.
Sample errors reflect the risk of selecting samples by chance, and a randomly chosen poll sample does not reflect the true point of view, and the "margin of error" reflected in the poll reflects the risk. The larger the sample, the smaller the margin of error. The sample data of 1000 respondents was sufficient to be a sample of many investigative purposes, and Mr Gallup's poll reportedly adopted 3,000 samples of respondents.
If the results of the survey of 3,000 respondents were correct, why did the 2.4 million samples not show more accurate results?
The answer is that sampling errors are often accompanied by a more dangerous factor: sample deviations. The sampling error is due to the random selection of the sample which causes the selected sample to fail to reflect the basic intention of the population, while the sample deviation is the selection of the sample which is not filtered and randomly selected. George Gallup to find an unbiased sample because he knew that an unbiased sample was far more important than a large number of samples.
On the contrary, the literature digest, which has been working to find a large sample of data, ignores the problem of sample deviation that may arise. On the one hand, it sends the survey form directly to the list of persons obtained from the car register and the telephone book, and the sample obtained in this way, at least in 1936, is disproportionate in terms of reflecting the true public opinion. On the other hand, to ease the seriousness of the problem, Langdon's supporters are happy to send their answers back. The combination of these two deviations makes the poll of Literature Digest a bust.
The big data has once again put the literature digest at risk. Because the collection of data groups is so messy, even if you want to figure out the underlying deviation factors in the data is very difficult. Moreover, because the data are too large, some data analysts seem to think that the problem of sampling is simply not worth worrying about.
Professor Viktor Mayer-schönberger of the University of Oxford Network College, co-author of the Big Data, told me that the big data set he was leaning on was defined as: N=all, without sampling on large data, we already have people with all backgrounds. When N=all, it is true that there is no sampling bias, as the sample contains everyone.
But is n=all the best description of most of the data found? Maybe not. "One person can have all the data, I doubt that," said Patrick Wolfe, a computer scientist and professor of data statistics at University College London.
Twitter is one example. In principle, it is possible to record and analyze every piece of information on Twitter and to judge public opinion by analyzing the results. (In fact, most researchers are using some of those big data) but when we can see all the Twitter messages, the user is not the representative of the whole population.
The author of Digital common Sense and the Kaiser Fung, a data analyst, remind us that we cannot simply think that we have taken all the important factors into account, he said, "N=all is a lot of time just an assumption of data, not a fact." ”
Big data thinking has not yet formed
In the face of big data, we often have to raise the question that we should be clear when confronted with a lot of messy data.
Take a look at a locally developed smartphone app Street Bump, which detects pits on the road using a mobile accelerometer, without the need for city workers to find pits through street patrols. As Boston's citizens download the app and drive around, their phones automatically prompt the city hall for repairs to urban street surfaces. In this process, through technology to solve the problem, creating a huge amount of "data emissions", and the emission of data is just a magical way to solve the problem, which was unimaginable a few years ago. Boston is proud to claim that "the data provide real-time information monitoring for the city, which can be used to solve urban problems and to plan long-term investment projects in the city." ”
But the street bump program actually produces a city map of road pits, which are more systematically distributed in young and affluent areas where more people have smartphones. Street Bump This program gives us a n=all of the situation, that is to say each of the mobile phone detection of every road pits can be recorded. This is not the same as recording every dent in the road. The Microsoft Institute, Kate Crawford, suggests that the data found contains systematic biases that require very careful thinking to detect and correct. Large data sets look comprehensive, but n=all often creates a rather confusing illusion.
There are few cases where the analysis of large quantities of data ultimately brings a miracle. David Spiegelhalter of the University of Cambridge talked about Google's translation software, which is an analysis of hundreds of millions of translated works, looking for translation services that can be replicated. This is a typical example of what computer scientists call "machine learning", a "learning capability" that allows Google's translation software to be incredibly processed without having to incorporate any grammatical rules in advance. Google translation is close to the theoretical support of the data-driven black box entirely. "This is a remarkable achievement," Spiegelhalter says, because the achievement is based on the judicious handling of big data.
But big data cannot solve the problems that have haunted statisticians and scientists for centuries: insight, judgment, and how to intervene properly to improve the system.
Getting such answers through large numbers requires a big step forward in statistics.
"Now we seem to be back in the west," said Patrick Wolfe of University College London. "Smart people will toss and turn and try to use every tool to get a good value from the data, but we're a little impulsive now." ”
Statisticians are struggling to explore new ways to capture the secrets of big data. Such a new approach is crucial, but it needs to be based on the old statistical theory of the past so that new methods can work.
Looking back at the four tenets of big data, if we ignore the active error message, it's easy to overestimate the high accuracy that makes people feel incredible. "The causal relationship in the data has been gradually withdrawn as the cornerstone of data research," If someone claims it, then it doesn't matter, if we're predicting data in a stable environment. But if the world is in a great upheaval (for example, through the spread of influenza) or if we want to make some changes in the real world, we can't say that. "Because of the n=all, the sampling bias is not important," the idea was not found in most cases.
The big data age has come, but big data thinking has not yet formed. The challenge now is to solve new problems and get new answers, but the premise is not to make past statistical mistakes on a larger scale.