"Big data" suddenly becomes ubiquitous, and it seems that everyone wants to collect, analyze and profit from the big data, while others boast or fear its great influence. Whether we're talking about using Google's massive search data to predict flu outbreaks or using phone records to predict terror, or using airline data to find the best time to buy a ticket, big data can help. The combination of modern computing technology and the vast numbers of digital times seems to solve any problem-crime, public health, change in terminology, the danger of dating, as long as we use that data.
It seems that its proponents claim it. "For the next 20 years," wrote reporter Patrick Tack in his recent big data statement, "It's a transparent future," "We can predict the many areas of the future with an unprecedented degree of accuracy, even some areas that have long been considered human beings cannot intervene." "But big data never really sounded so good.
Is the big data really as good as it sounds? There is no doubt that big data is indeed a valuable tool and has a crucial impact in some areas. For example, nearly 20 years of successful AI computer programs, from Google's search engine to IBM's Watson computer question and answer system, include a lot of data processing. But precisely because it has recently been so popular and widely used, we need to see clearly what big data can and cannot do.
Big data can tell us what it is but can't tell us why
First, although large data can be very good at detecting dependencies, especially those subtle correlations that may not be measurable with small datasets, it does not tell us which relevance is meaningful. For example, a large data analysis might reveal that the proportion of your American murders from 2006 to 2011 is extremely correlated with IE's market share, both of which are on a steep downward trend. But it is hard to believe that there is a causal relationship between the two. For example, patients diagnosed with autism from 1998 to 2007 are associated with the sale of organic food (both are on a rapid upward trend), but the correlation does not in itself tell us about the relationship between diet and autism.
Large data can only be an auxiliary tool
Second, large data can be used to assist scientific investigations, but it is not possible to successfully replace them completely. Molecular biologists, for example, would like to infer the three-dimensional structure of proteins from potential DNA sequences, and some scientists are already using large data to solve the problem. But no scientist thinks that you can solve this problem entirely by dealing with data, no matter how powerful the data analysis is, you still need to process the data based on the understanding of physics and biochemistry.
Tools based on large data are easy to fake
Third, many tools based on large data are easy to fake. Large data programs for correcting students ' compositions usually depend on the length of the sentence and the complexity of the word, which indicates that it is very relevant to the teacher's grading. But once students know how the program works, they start writing long sentences and using obscure words instead of learning how to regulate clear expressions and form coherent chapters. Even Google's famous search engine, which is often considered a success in large data cases, is not immune to information complexity, useless search results, and some man-made reasons for some search results in front (search ads).
It's risky to jump to conclusions with big data.
Four, even if the results of large data are not artificially falsified, it does not look so effective. For example, Google's prediction of the flu case used to be a model for big data. In 2009, Google, through considerable publicity, said it could predict the trend of flu outbreaks by analyzing flu-related searches, which is more accurate and faster than even official agencies such as the Centers for Disease Control and Prevention. But a few years later, Google's alleged flu forecasts have not been well received, and in the last two years it has done more to make predictions of uncertainty.
A recent article in the journal Science explains that the failure of Google's flu prediction is largely due to the fact that Google's search engine is constantly updating itself, and that the data collected at this time may not be appropriate for the data collected at the next time. As the statistician Feng Qishi (author of the Data rules the world) puts it, large data collections that rely on Web sites often combine data from different methods and different purposes, sometimes negatively. It is a risk to draw a conclusion from such a data sample.
Intelligent applications of large data can cause errors to be enhanced
The fifth thing to pay attention to is the "vicious circle", because a lot of data comes from the network. Whenever the information source of large data analysis is a large data product, it is likely to lead to a vicious circle. Translation programs such as Google translation are extracts of similar texts from different languages to identify the translation patterns of these languages, such as the same Wikipedia entries in two languages. This is a reasonable strategy, if not many languages do not have too much similarity, Wikipedia itself can use Google translation to write entries. In this case, any Google translation errors will affect Wikipedia, and this will be reflected in Google's translation, so that the error continues to strengthen.
Large data can cause large errors
The sixth thing to worry about is the risk of too much correlation. If you are constantly looking for relevance in two variables, you are likely to find false correlations purely by accident, even if there is no meaningful connection in these variables. Without careful examination, the magnitude of the large data will enlarge these errors.
Sound scientific explanation may not be correct
The big data, it's easy to give a sound scientific explanation for a problem that can't be precise. In the past few months, for example, Wikipedia-based data have two different attempts to rank people: based on historical importance or cultural contributions. One of the books is called "Who is stronger?" Where is the real ranking of historical figures, the author is computer engineer Steven Skiena and engineer Charles Ward, and the other called the Pantheon, from the MIT Media Lab Project.
These attempts are correct in some ways, and Jesus, Lincoln, and Shakespeare are indeed very important figures, but both have made some serious mistakes. "Who is stronger?" points out that Frances Scott Kay (Francis Scott Key) has historically been the most important writer in the 19th century, far surpassing Jane Austen (78th Place) and Eliot (No. 380 place). More seriously, the two books show the use of so-called precision misleading, and in essence, vague appreciation meaningless. Big data can simplify everything to numbers, but you shouldn't be fooled by these "science" performances.
Rare event, large data does not work
Finally, large data are good at analyzing ordinary events, but often fail to analyze rare events. For example, programs that use large data-processing texts, such as search engines and translation programs, often rely on so-called "three words": sequential three-word sequences (such as "in a row"). Reliable data information can be compiled in a conventional three-word model, precisely because they often appear, but the existing data is not much enough to cover all the "three words" people may use, because people are constantly creating new languages.
To pick an example, Rob Lowe's recent book review of the newspaper has nine "three word sequences" such as "Dumbed-down escapist fare", which has never been seen in Google's text. Google has a lot of restrictions on these new words, Google will "Dumbed-down escapist fare" Xi ' an translated into German and then translated into English, and finally appeared such a illogical word "scaled-flight fare." Mr Lowe's intentions and the translation of large data are totally off the table.
And so on, we almost ignored the last question: hype. Proponents of big data claim that it is a revolutionary advance. But even the success stories that give big data, such as the predictions of Google's flu trends, are trivial, if useful, for something bigger. Compared to the great inventions of the 19th century and 20th century, such as antibiotics, automobiles, airplanes, the big data came from nothing.
We need big data, no doubt. But we also need to be more aware that this is an important resource that everyone can analyze, not a new technology.
Is talking about using Google's massive search data to predict flu outbreaks or using phone records to predict terror, or using airline data to find the best time to buy tickets, and big data can help. The combination of modern computing technology and the vast numbers of digital times seems to solve any problem-crime, public health, change in terminology, danger of dating, as long as we use that data
It seems that its proponents claim it. "For the next 20 years," wrote reporter Patrick Tack in his recent big data statement, "It's a transparent future," "We can predict the many areas of the future with an unprecedented degree of accuracy, even some areas that have long been considered human beings cannot intervene." "But big data never really sounded so good.
Is the big data really as good as it sounds? There is no doubt that big data is indeed a valuable tool and has a crucial impact in some areas. For example, nearly 20 years of successful AI computer programs, from Google's search engine to IBM's Watson computer question and answer system, include a lot of data processing. But precisely because it has recently been so popular and widely used, we need to see clearly what big data can and cannot do.
We need big data, no doubt. But we also need to be more aware that this is an important resource that everyone can analyze, not a new technology.