This is a time when everyone says "big data", but where does "big data" exist? Where does it affect? Inevitably, the general public was swept by the "big data" of the tide confused eyes, disturbed the mind. It is at this moment that the author believes that it is particularly important to maintain a sense of awe and a sober mind, recognizing the limitations of "big data".
Penetrating moment, large data everywhere
Big data may be one of the most eye-catching topics of the moment. From the sales of flowers and condoms to analyze the romantic index of different cities to find the deep inland of Xinjiang people instead of bikini sales first, from the contribution to energy saving and emission reduction to the German national team to use large data technology to collect player information to the World Cup to according to the enemy airport Take-off and landing signals, a minute analysis of Take-off and landing batches, fighter models and other details , and then to get the 86th annual Oscar award for the Best Original Screenplay award of "she", the protagonist and the AI system between the feelings gradually deepened until falling in love with each other, big data brings endless reverie, bring infinite wonderful possibilities.
As Mr Ma says, humans have moved from the IT era to the DT era. Alibaba Group data Commissioner Cheping in his book "Big Battle Data" inside also highlighted two important points: first, the big data completely eliminates "sample deviation" (sample bias). "The sample is different from the big data. Large data believe in total data, not samples; it is analyzed, not sampled; second, the correlation analysis in the large data age can create scenes that could not have been imagined before. In extreme cases, the accumulation of online data can form the personal "online personality", affect even the control of people's offline behavior.
Pride is sin, keep a heart of awe
The prospect of big data is so good that I have no answer. Yet arrogance is sin. "The Wisdom Fruit" lets the human have the wisdom, but also lets the human who leaves the Eden from now on cannot escape arrogant original sin. From "Babel Tower" to "building the Kingdom of Heaven on earth", people who lose awe often cause great harm to themselves. In the big data age, we should also maintain a reverence and recognize the following three points.
Sample deviation always exists, large data does not exceed statistics
What is a sample deviation? The most wonderful example of this is the World War Ii. The simplified version is that the RAF is distressed by the German ferocious anti-aircraft fire and wants to reduce the fighter's attrition rate by strengthening the armor of the aircraft. But under the weight of the aircraft, only part of the body to strengthen the armor. To this end, they turned to a statistician. After careful observation of the bullet marks on the plane, the expert came up with an unexpected conclusion: armor was added to the site without traces of the bullets. In the face of questioning, statisticians answered only one sentence. "The planes that were playing were falling." It can be seen that statistics are always a skill to live, and no two brushes are crucial to the dead.
Statistically, in essence, is the theoretical system of predicting the future with a partial conjecture of the whole. The biggest weakness is that the sample deviation will invalidate the conclusion when part of the whole is speculated. So, in the big data age, whether really came to heaven, no sample deviation of the trouble? The answer is obviously negative. From the perspective of the phenomenon, even in the large data age, data and application scenarios can be severely fragmented. Take Valentine's Day flower and condom ratio This example, based on "You Know" reason, many condom consumption occurs offline, online can not get the data. Because of the limitations of technical means or business model itself, the data that the online system can collect is only a part of the complete scene, not the whole data. For example, Xinjiang people's bikini sales first example. If the data analyst is not aware of the real situation, Xinjiang's bikini sales are mainly on the line (offline traditional channel sales are small or basic?). But in other provinces bikini sales are mostly online (on line sales are 8%~10%) and will come to the wrong conclusion. At the same time, in Xinjiang, Taobao Cat online sales of the basic representative of the real online sales. But in the North canton of these first-line cities, Jingdong's online sales have been and Taobao cat, only to consider the data of Ali, will seriously underestimate the real sales.
Theoretically, the fragmentation of the data and the application scene is essentially the sample deviation. Because of the technology or benefits, the data collected in the large data age can not completely cover all aspects of the application scene, the data obtained is still part, not all. Finally, from a philosophical point of view, even after the technology has made great strides to solve the problem of data and scene fragmentation, but also have a perfect business model to allow competitors to share data, sample deviation will still exist. The core is that although human beings have the ability to understand all the laws of the objective world, the objective world itself is not static, but in constant motion. The data of the past, must not reflect the future development of the objective world law. The idea of "Kezhouqiujian" is not realistic. From this perspective, the nature of the "Black Swan" event is the sample deviation. The advanced technology and the exquisite business model cannot solve the problem. So, even in the big data age, people should have a sense of awe, and in this era, technology did wander to the edge of religion.
The conclusion of large data is the overall conclusion in statistical sense, not for individual
Any theoretical analysis and conclusion based on statistics are holistic. Asimov made this point perfectly in his book "Base". Harry Sheton, who studied the exascale inhabitants of the 20 million planets in the Milky Way, successfully created the psychology of history, and succeeded in predicting the Galactic Empire's 30,000-year period of dark barbarism and the emergence of the second Galactic Empire. But the theory cannot be used to predict individuals. So it can't predict the appearance of the mutant Mule. Without the existence of a second base, the entire recovery plan was almost out of control. "Runaway" also describes a similar phenomenon. Fish in the deep sea as a whole, the behavior of the law is very easy to predict. But individual behavior is erratic and unpredictable. Taobao/cat "thousand people face" is an important attempt in the era of large data. Its core based on large data, Taobao/cat customers show personalized search results. The core details of the project are not known to outsiders, but based on theoretical analysis, reasonable speculation can be made. First of all, Taobao/cat collected data must not be called "full data", under the existing conditions, a lot of customer purchase interest related to the core data can not be collected. Second, even if the model accuracy can reach 99%, for an billion-scale platform, there are nearly tens of thousands of customers will have a relatively poor user experience. Based on this, "thousand people face" personalized degree must be rationalized constraints, otherwise, the better the ideal, the reality will be more bone.
Third, the relevance is not causal, the application of traps and opportunities as much
Correlation analysis is a powerful tool for data analysis, and it is also the easiest place to introduce problems. Correlation is not causation. Statistics show that when ice cream sales rise, the number of drowning in water will rise rapidly, the two have a very strong positive correlation. So does the consumption of ice cream cause people to drown? The answer is obviously negative. It's just that hot weather increases the chance of ice cream consumption and people's water activities. A more convincing example is a period of statistical data showing a strong positive correlation between liquor prices and pastoral income. Are the priests all "wine-meat-pierced, the Buddha's heart to stay"? The answer is no, the real reason is that inflation has also led to a rise in liquor prices and priest income levels. In the big Data age, the confusion of correlation and causation can cause problems far beyond the past. Large data age, the data is extremely sufficient, the computational ability is extremely strong, may discover in the past cannot discover the correlation. This is an exciting place for the big data age. But at the same time, the correlation and causal discrimination is very difficult to improve. Once the judgment is wrong, it will cause great problems. For example, the current Ali small loan is proud of the credit discriminant model and automatic lending. Assuming that the current credit model correlation is ineffective, "namely the inflation rate is stable for a long time, the liquor price and the priest income are no longer strong correlation", the real credit rating of the subject which is screened by the existing model will have great risk and the consequence is unimaginable. The above analysis is purely theoretical and does not point to a specific project, but with the progress of large data technology, it is more and more difficult to discern correlation and causality, and the risk will be higher.
The most understandable thing about the world is that it is incomprehensible. The most incomprehensible thing about the world is that it is understandable. The big data age, we need to have a fear of the heart. Arrogance is sin.
(Responsible editor: Mengyishan)