This article begins with the understanding and thinking of Victor's predecessor "The Big Data age". The wave of big data has been hitting China's land in a wave of waves, all walks of life are actively exploring opportunities for integration and development with this technology, so we are lucky to witness and witness this age of technological change. The reason for saying that the big data era is a revolution is not only brought about by technological advances, but also the impact and change of thinking that accompanied it is unprecedented. These praises for the arrival of the big data age come from the beginning to the end. However, according to my immersive observations, I found that there were a few noisy and biased voices in the crowd. So I wanted to write an article to express my understanding and communicate with netizens.
In the big data age, Victor's predecessors proposed three features of big data:
- More data: not a random sample, but all data.
- More complex data: not accurate, but hybrid.
- Data Relationship: it is not a causal relationship, but a correlation.
The following is a brief description of my understanding of it.
I. Data should be non-Samples
Big Data refers to the entire data set, rather than the sample obtained from random sampling. However, most people will inertial think that the big data is an absolute amount larger than the existing data, without the concept of data as a whole, that is to say, if all the data we are studying is of the order of MB, then our research is also big data. This inertial belief begins with the continuation of sampling statistics analysis in the Age of small data. The statisticians at that time proved that the accuracy of sampling analysis is greatly improved with the increase of random sampling. However, it has little to do with the increase in the number of samples, that is, when the number of samples reaches a certain value, there will be less and less information from the new individual. This feature makes up for the fact that we were unable to obtain and process more data at the time. However, in our bottom of our hearts, our desire to obtain more and more accurate data has never been met.
Statistical sampling analysis with less than one hundred years of history is regarded as one of the cornerstones of the establishment of civilization, just like the geometric theorem and the law of universal gravitation. However, this cannot cover up the inherent defects caused by the inability to collect, store, process, and analyze the overall data due to the technology backwardness of the times, and the adoption of a short-cut to the big picture: 1. It cannot be implemented due to randomness, it is very difficult to evaluate the subcategories of the problem. 2. The missing information in the sample collection process cannot be found.
This feature of big data tells us that we need to pay attention to all the data. We cannot meet the common phenomenon of normal distribution, which is often concealed in details in our lives, however, the sampling analysis method cannot capture these details.
Ii. Accept mixed data
When our field of view expands from the sample to the population, the involved data must be more or less added to some data that seems to be wrong in the original standard. What I want to explain is that the existence of a mistake is bound to have reasons for its existence like anything else, and the extreme pursuit of accuracy is tantamount to the deliberate evasion of truth. This reason should include two points for big data: the extensiveness of data and the high frequency of sampling. For the extensiveness of data, kairman said that "measurement is cognition", and cognition is a process from ignorance. This process should be continuous rather than Skip. The more constraints the finer the better understanding, but with the deepening of cognition, we will remove or modify some constraints to make the problem more likely to be included, inevitably, there may be some phenomena that are contrary to the previous constraints, that is, mixing. For the high-frequency characteristics of sampling, it will make up for some unknown information lost by a small amount of data in the previous gap. In a word, there are reasons for the existence of the greatest truths.
Simple Algorithms Based on big data are more effective than complex algorithms based on small data.
Iii. Correlation beyond causality
- In the past, many things that are hard to be inferred through causal relationships can be predicted by finding correlations.
- However, the pursuit of definite cause-and-effect behavior will not disappear, and the prediction of big data will be viewed as a street lamp for such behavior, the assumption established by the causal Thinking of the problem will be prone to errors due to prejudice. If the causal proposition established through the relevant relationship can be used as the direction of the study of the empirical theory. This may become a mode of social science and technology progress. The two relations complement each other and promote each other.
- As a result, there will be some confusions. The related relationships are helpful for the causal relationship. But today, when technology is developing so fast, we know what it is, why? Will the transition of the times caused by this cause a theoretical fault, so that people can discard the importance of the theory?
- I think the question raised in the third point is no, because the interpretation of the study results requires theoretical support.
Conclusion: I think Victor's predecessors are very accurate to capture the three features of big data, but some of the interpretations mentioned in this article are my dummies, I hope to get criticism, correction, and communication from everyone.