This article originally contained the FT website, original title: Big Data:are We making a big mistake, it seems that I saw a bit late, but still share it. Because it does discuss some of the issues I've been thinking about recently, it's a speculative article. If you've never known big data before, this can be considered a primer.
This article is only opinion, because the query is more simple than the proof, but the big data is now hot, these negative comments, if as a conversation to collect, it is also good, perhaps when the sister will be magical.
Large data is a vague expression of large-scale phenomena. The term has now been overheated by entrepreneurs, scientists, governments and the media.
Five years ago, a Google research team announced a remarkable achievement in the world's top science magazine Nature. The team can track the spread of flu in the United States, which does not depend on any medical examination. They are tracking even faster than the CDC. Google's tracking results are only a day's delay, and the CDC needs to summarize a large number of physicians ' diagnoses to get a spread of the trend chart for more than a week. Google can count so fast because they find that when people have flu symptoms, they tend to go to the web and search for something relevant.
"Google Flu Trends" is not only quick, accurate, low-cost, but also does not use any theory. Google's engineers don't have to bother to assume that search keywords (such as "flu symptoms" or "pharmacies around Me") are associated with colds. They just have to take out 50 million of the hottest search words on their website and then let the algorithm do the selection.
The success of Google's flu trend soon became a symbol of the latest trends in business, technology and science. Excited media reporters keep asking, what new technology has Google brought us?
In many of these buzzwords, "big data" is a vague word that often appears in the mouths of marketers. Some people use the word to emphasize the staggering scale of the amount of data available-the LHC produces 15PB of data a year, equivalent to the size of your favorite song that repeats 15,000 years.
However, in "Big Data", most companies are interested in so-called "real data", such as Web Search records, credit card consumption records, and communication records of mobile phones and nearby base stations. The Google flu trend is based on such realistic data, which is the kind of data discussed in this article. Such datasets are even larger than the collider's data (such as Facebook) and, more importantly, are relatively easy to collect, although they are large in size. They are often collected for different purposes and stacked together in a cluttered, and can be updated in real time. Our communications, entertainment, and business activities have been shifted to the Internet, and the Internet has entered our mobile phones, cars and even glasses. So our entire life can be recorded and digitized, which was unimaginable ten years ago.
The big data advocates have made four exciting assertions, each of which confirms the success of Google's flu trend:
1 data analysis can produce amazing and accurate results;
2 because each number of positions can be captured, so can completely eliminate the previous method of sampling statistics;
3 no longer look for the reasons behind the phenomenon, we just need to know that there is a statistical correlation between the line;
4 no longer requires a scientific or statistical model, "theory is terminated". "The data is so big that you can say it yourself," Wired magazine wrote in a 2008 article.
Unfortunately, the above beliefs are extremely optimistic and simplistic. If anything, these four are "downright nonsense", as David Spiegelhalter, a Winton professor at the University of Cambridge's public-risk-awareness class, says.
As companies such as Google, Facebook and Amazon continue to understand our lives through the data we generate, the real data underpins the new Internet economy. Edward Snowden has uncovered the size and scope of US government data surveillance, and it is clear that the security services are equally obsessed with digging something out of our daily data.
The consultants urged data whites to quickly understand the potential of large data. In a recent report, the McKinsey global Agency has calculated that, from clinical trials to Medicare claims to smart shoes, if all these health-related data are better integrated and analyzed, the U.S. health insurance system can save 300 billion of dollars a year. On average, every American can save 1000 dollars.
While the big numbers look promising in the eyes of scientists, entrepreneurs and governments, if we ignore some of the lessons of statistics that we have known before, big Data may be doomed to disappoint us.
Professor Spiegelhalter once said: "There are a lot of small data problems in big data." These problems do not disappear as the volume of data increases, and they are only more pronounced. ”
4 years after the publication of the Google Flu trend forecast, a new issue of Nature magazine reported bad news: Google's flu trend did not work in the latest flu outbreak. The tool was once reliably operated for more than 10 winters, providing rapid and accurate outbreaks of influenza in the context of massive data analysis and the need for theoretical modelling. But this time it got lost, and Google's model showed that the flu outbreak was very serious, but the CDC, after slowly compiling data from around the world, found that Google's predictions were almost one-fold worse than they actually were.
The root of the problem is that Google does not know (at first) what is the connection between the search keyword and the flu spread. Google's engineers are not trying to figure out why the connection is behind them. They just found some statistical features in the data. They focus more on the relevance itself than on the cause. This practice is common in large data analysis. It is difficult, perhaps impossible, to find out what causes a certain outcome. Finding the correlation between two things is much simpler and quicker. As Viktor Mayer-sch Nberger and Kenneth Cukier described in the book Big Data: "Causation cannot be ignored, but it has been drawn to the throne as a starting point for all the conclusions." ”
This kind of pure correlation analysis without any theory will inevitably result in fragility. If you don't know the reason behind the correlation, you won't be able to tell when the correlation will disappear. One explanation for the error in the Google Flu trend is that the December 2012 media is full of scary stories about the flu, and after seeing these reports, even healthy people run to the Internet to search for relevant words. There is another explanation, is Google's own search algorithm, when people enter the disease will automatically recommend some diagnostic results to affect the user's search and browsing behavior. It was like moving a post in a football match, the ball flying into the wrong door.
Google will use the new data to calibrate the flu trend again this product, again. This is certainly the right approach. There are 100 reasons why it is exciting to have more opportunities for us to simply collect and process large-scale data. However, we must draw enough lessons from the above examples to avoid the same mistakes.
Statisticians have spent the past more than 200 years summarizing the pitfalls of the process of cognitive data. Now the data is bigger, newer and faster, and the cost of collecting is even lower. But we cannot pretend that these traps have been filled, but they are still there.
In 1936, Alfred Landon, the Democrat, campaigned for the next president, with President Franklin Delano Roosevelt (Franklin Roosevelt). Reader's Digest, the prestigious magazine, assumed the task of predicting the situation. At that time, the questionnaire was used, and the investigators were ambitious and planned to send out 10 million questionnaires covering one-fourth of the electorate. It can be foreseen that the flood mail will be more than expected, but the digest seems to have enjoyed it. In late August they wrote: "From next week onwards, the first batch of 10 million questionnaires will arrive, which will be the beginning of a follow-up mail flood peak." All these forms are checked three times, checked, interleaved five, and then aggregated. ”
In the end, the digest received an astonishing 2.4 million receipts in two months, and after the statistics were completed, the magazine announced that Landon would win the general election with a 55:41 advantage, while another 4% would vote for the third candidate.
But the real election results are quite different: Roosevelt won by a landslide of 61:37. To make Reader's Digest more embarrassing, the first-gen George Gallup, a survey of opinion, made a much more accurate prediction with a small scale questionnaire. Gallup expects Roosevelt to be a shoo-in. Evidently, Mr. Gallup had his own way. From the point of view of the data, size does not determine everything.
The opinion survey is based on a large range of samples of voters. This means that investigators need to deal with two problems: sample error and sample deviation.
Sample error refers to a group of randomly selected sample views that may not be true to reflect the views of the whole population. The margin of error decreases as the number of samples increases. For most surveys, 1000 interviews are already large enough to sample. Mr. Gallup was reportedly interviewed 3,000 times in total.
Even if 3,000 interviews were good, wouldn't that be better 2.4 million times? The answer is no. The sample error has a more dangerous friend: sample deviation. Sample error means that a randomly selected sample may not represent all other people, and a sample deviation means that the sample may not be randomly selected. George Gallup a lot of effort to find a collection of samples without bias, because he knew it was far more important than increasing the number of samples.
The reader's Digest, in search of a larger dataset, was trapped by a biased sample. They choose which objects to mail the questionnaire from the vehicle registration information and telephone directory. At the time of 1936, the sample group was a wealthy class. And Landon supporters seem more willing to send back the results, making the mistake a step further. The combination of these two deviations determines the failure of the Digest survey. Gallup each interview a person, "digest" corresponds to receive 800 receipts. It is really embarrassing that such a large and accurate survey will eventually lead to a false result.
Now the frenzy of big data seems to remind people of Reader's Digest. The collection of realistic data is so confusing that it is difficult to find out if there is a sample bias in this area. And because of the size of the data, some analysts seem to have decided that sampling-related problems are no longer needed. In fact, the problem remains.
Professor Viktor Mayer-sch Nberger of the Oxford University Internet Center, a co-author of the Big Data book, told me his favorite definition of a large data set is "n= all", where no sampling is needed because we have data for the entire population. Just as the election watchdog will not find a few representative votes to estimate the outcome of the election, they will be counting every vote. There is no question of sampling bias when "n= all", because the sample already contains everyone.
But is the formula "n= all" valid for most of the actual data sets we use? I'm afraid not. "I don't believe anyone can get all the data," said Patrick Wolfe, a computer science and statistics professor at University College London.
Twitter is an example. In theory, you can store and analyze every record on Twitter, and then derive some conclusions from the public mood (in fact, most researchers are using a subset of Twitter-supplied data called the "Fire Hose"). But even if we can read all the Twitter records, Twitter's users themselves don't represent anyone in the world. (according to the Pew Internet Research project, in 2013, young people in the United States, who live in large cities or towns, have a high percentage of black skin)
We have to figure out which people and what things are missing in the data, especially when we're dealing with a bunch of messy reality data. Kaiser Fung, author of a data analyst and digital Sensing book, warns people not to simply assume they have all the relevant data: "n= is often a hypothesis, not a reality, of data."
There's a smartphone app in Boston called "Bumpy Street," which uses accelerometer sensors in the phone to check out potholes in the streets, and the application workers can stop patrolling the road. Citizens of Boston have downloaded the app, and their mobile phones will automatically upload vehicle bumps and inform City Hall where the road needs to be serviced, as long as they drive in a town. Something that seemed incredible a few years ago, so through the development of technology, the information to the poor way to solve the beautiful. The Boston government is therefore proud to announce that "large numbers provide real-time information to the city, help us solve problems and make long-term investment plans".
The "Bumpy Street" is a map of potholes in the installation of its equipment. However, from the beginning of product design, this map is more inclined to the younger and richer neighborhoods, because there are more people use smartphones. The idea of "bumpy street" is to provide "n= all" information about the location of potholes, but this "all" refers to data that all mobile phones can record, not all potholes. As the Microsoft researcher Kate Crawford points out, the real data contains systematic biases that people need to consider carefully before they can be found and corrected. Large data sets seem all-encompassing, but "n=" is often a seductive illusion.
The reality of the world, of course, is that if you can earn money on a concept, no one will care about causation and sample deviation. Companies around the world are expected to covet after hearing the legendary success of US discount chain Target (reported by Charles Duhigg of the New York Times in 2012). Duhigg explains how Target has collected a lot of data from its customers and is skilled in analyzing it. Its understanding of customers is superb.
The most telling story of Duhigg is this: A man stormed into a target chain near Minnesota and complained to the shop chief about the company's recent coupons to send his teenage daughter a baby costume and maternity dress. The store grew up apologizing to him. But soon after the store chief received the man's phone call to apologize again-just this time the other party told the girl is really pregnant. When her father did not realize it, target guessed by analyzing her record of buying tasteless wipes and magnesium supplements.
Is this a statistical magic? Perhaps there is a more mundane explanation.
Kaiser Fung has many years of experience in helping retailers and advertisers develop similar tools, and he thinks "there is a serious problem of false positives". He is referring to countless negative stories that we usually don't hear, and in those cases women who are not pregnant also receive coupons for baby supplies.
If you just listen to Duhigg stories, it's easy to think that the target's algorithm is absolutely reliable--every person who receives a baby jumpsuit and a wet tissue voucher is a pregnant woman. It's almost impossible to make a mistake. But the fact that pregnant women receive these coupons may be simply because target sends the coupons to everyone. Before you can trust Target's mind-reading stories, you should ask them how high the hit rate is.
In the description of Charles Duhiggs, Target will randomly mix something unrelated to your shopping voucher, such as a wine cup coupon. Otherwise, pregnant women may find the company's computer system in such depth to detect their privacy, and then feel uneasy.
Fung has another explanation for this, saying that target did not do so because it would be suspicious to send pregnant women a shopping manual full of baby supplies, but because the company knew that the manuals would be sent to many women who were not pregnant at all.
These views do not mean that data analysis is useless, but it may be highly commercial. Even if you can improve the accuracy of mailing a little bit, it will be profitable. But making money does not mean that the tool is omnipotent and always true.
An infectious disease scientist named John Ioannidis published a paper in 2005 titled "Why Most of the published research results are wrong" and the title is concise. One of the core ideas in his paper is what statisticians call "multiple comparisons."
When we look at an image in the data, we often need to consider whether the representation is accidental. If this appearance seems unlikely to be random, we call it "statistically significant".
Multiple comparison errors can occur when researchers face many possible appearances. Suppose there is a clinical trial, we let some pupils take vitamins and give other pupils a placebo. How do you judge the effect of this vitamin? It all depends on our definition of "effect." The researchers may look at the children's height, weight, tooth decay probability, class performance, test scores, or even a 25-year-old income or prison sentence (long-term follow-up). Then there is a comprehensive comparison: is the vitamin effective for children in poor families or for wealthy families? Is it effective for boys or girls? If you do enough different correlation tests, the occasional results drown out the real discoveries.
There are many ways to solve the problem, but in large numbers the problem is even worse. Because there are too many criteria for comparison in the case of large data compared to a small data set. Without careful analysis, the ratio of a real representation to a false representation-equivalent to a signal-noise ratio-will soon be approaching 0.
To make matters worse, we used the process of increasing transparency to solve the problem of multiple comparisons, that is, to let other researchers know what assumptions were tested and what negative results were not published. Yet the real data is almost opaque. Amazon and Google, Facebook and Twitter, target and Tesco, these companies are not going to share all their data with you and me.
There is no doubt that newer, larger, cheaper data sets and powerful analytical tools will eventually yield value. It is true that there are some successful examples of large data analysis. David Spiegelhalter of Cambridge referred to Google translation, a product that analyzed countless documents that humans have translated and found patterns to replicate themselves. Google translation is an application of what computer scientists call "machine learning", where machine learning can compute amazing results without pre-set programming logic. Google translation is currently known as the most close to the "no theoretical model, pure data-driven algorithm black box" This goal of the product. In Spiegelhalter's words, it is "an astonishing achievement". This achievement comes from the clever handling of massive amounts of data.
Yet the big data does not solve some of the problems that statisticians and scientists have been working on for hundreds of of years: Understanding the causal relationship, deducing the future, and intervening and optimizing a system.
"Now we have some new data sources, but no one wants the data, people want the answer," said Professor David Hand of the Royal College of London.
To use large data to get such answers, a great deal of progress has to be made in statistical methods.
"Big data is like the Wild west of the United States," says Patrick Wolfe of UCL. It's cool that those who are nimble and ambitious will try their best to use every possible tool to get something valuable from the data. But we're still a little blind at the moment. ”
Statisticians are scrambling to develop new tools for big data. These new tools are of course important, but they can only succeed if they absorb rather than forget the essence of past statistics.
Finally, we look back at the four basic tenets of the big data. First, if we simply ignore the negative data, such as Target's pregnancy prediction algorithm, we can easily overestimate the accuracy of the algorithm. Secondly, if we make predictions in a fixed environment, you can assume that causality is no longer important. And when we're in a world of change (like the flu trend predictions), or we want to change the environment ourselves, that's a dangerous idea. Third, "n= all", and sampling deviations do not matter, these premises are not established in the vast majority of the actual situation. Finally, it is naïve to think that when the fake image in the data is far more than the truth, it holds the view that "when the data is big enough, you can say it yourself."
The
Big data has come, but it has not brought new truths. The challenge now is to learn from the old lessons of statistics, to solve new problems and get new answers in a much larger scale than before.