Introduction 1988, I study in mathematics department of Zhejiang University, Professor Van Dayin "Probability theory". I once asked her: "The proportion of women and men in newborn births nationwide is 51.2:48.8." If the statistical results of the provinces are the same, do they contain more information? "Teacher Fei said:" If the same probability occurs, the results of the provinces have no more information. "More than 20 years later, I realized: In theory, teacher Fei's answer is completely correct, but in reality, the amount of information is not the same."
Big Data is a popular word and has received widespread attention from the world's industry. An old academician once said: "Big data major achievements, and infringe http://www.aliyun.com/zixun/aggregation/9799.html" > personal privacy. Indeed, there are few successful applications of large data in the industry. I believe that large data will bring great changes to the industry. But at the same time, working on large data in industry is a risky job. Most people may return drubbing. It is not difficult to understand the big data, it is difficult to be fooled by some bizarre concepts. If you don't want to be fooled, you need to understand the essence.
Some people say that the most essential feature of large data is the large amount of data, there should be PB, EB level. Why must it be this magnitude? Below this level, the past methods can be effectively stored, transmitted and processed; After this magnitude, new theories, methods and ideas are needed. So the expansion of the data level has spawned new theories. However, from an application perspective, there seems to be no need to: less than this level of data analysis is often not done well-data mining theory for decades, the number of successful cases are not many. Therefore, from the point of view of the theorists, it is reasonable to emphasize the quantity of data, but from an engineer's point of view, it is not much to emphasize the amount of data.
From the point of view of application, is the amount of data important? can be a different formulation: to study a problem, 10 data, 100 data and 10,000 data have different? In the past, the difference was not very great. For example, to do linear regression, the number of samples than the independent variable one is enough, if more than a few times, basically very good. Using the Neuron method, the sample number is more than one order of magnitude. In these methods, the data is much more difficult to play a larger role.
Is the extra data really useless? My feeling is: more data is not useless, but will not use, difficult to use. It is not the individual who does not use, but the universality. What is the secret?
Anyone who has studied probability or statistical theory knows that all mathematical theories are based on specific assumptions. For example, the interference according to a certain probability distribution occurs, the independent variable detection error can be ignored. In many cases, we always think of course that these conditions are natural. As a result, people are accustomed to doing the analysis directly according to the books.
But in reality, the hypothesis of theory is often not tenable. When analyzing industrial processes or equipment, the distribution of data is often very irregular; random assumptions often lead to erroneous analysis. Take a look at the demographic issues at the beginning of this article: we assume that the sex of the child occurs in a certain probability. However, this is only hypothetical. In fact, the proportion of births in the Chinese population has changed a lot over the years, and the provinces are different.
If statistical research is strictly carried out, the first thing to confirm is whether a random phenomenon occurs according to the fixed frequency. Only this condition satisfies the basic condition of ' probability '. Subsequent analysis will result in a reliable outcome.
So, we need more data to validate some basic assumptions. At this point, the volume of data requirements will be greatly improved. In addition, when the signal-to-noise ratio of the data is low, the demand for the data volume will increase greatly. The author has done a study, found that analysis of the role of an element, need to 2000~20000 data.
In this way, the extra data will be useful. With a lot of data, we can guarantee the correctness of the analysis.
One might ask: are non-traditional methods like neurons not asking for data? Indeed, the neuron approach does not explicitly ask for anything. But who can guarantee the reliability of the results? In fact, there is a potential requirement for the use of neuron method: Modeling data is sufficient, and future data distribution is unchanged. The requirement of ' distribution invariant ' is very high: not only is the data distribution range and density unchanged, but also the relationship between the variables, the distribution of interference is unchanged. This requirement, in reality, is difficult to verify and to say clearly. As a result, the reliability of the results is not clear. This is very unfavorable to the practical application.
Since more data is useful, can the ' big data ' requirement be lowered a little? The author thinks: If need a lot of data to complete the specific analysis task, and need new Thought and method, can see the category of big data. It is not necessary to overemphasize the amount of data.
The analysis just now may be a little theoretical. Here's a concrete solution.
I have long been engaged in industrial data modeling activities. It is very important to know the reliability of analytical results. The reliability and practical value of the analysis results are often two sides of the coin: if the correct discovery creates great value, the mistaken understanding will inevitably lead to significant losses. Therefore, the greater the value of the analysis results, the higher the demand for reliability. This is precisely the difficulty of data analysis.
We would like to have more data to achieve reliability.
With a large number of wide, distributed area of data, not only can verify the rationality of the data, but also can reasonably combine data to meet specific analysis requirements, to achieve specific analysis purposes. At the same time, there are many data, but also through the analysis of the results of mutual check, multi-angle, all-round analysis of the correctness of the specific conclusions-this is small sample data can not do. Especially when the data error is relatively large or related factors are more.
Here, I also think of the other features of the Big data: "Speed", "Diversity", "low value density". From an application point of view, these characteristics do not seem to be very significant.
1. Produce fast. Increase the difficulty of the analysis, the benefits of the application is not much, it is only theoretically valuable.
2. Low value density. Also increases the difficulty of the analysis. But for applications, this is a phenomenon that does not seem to be worth emphasizing. In fact, in order to obtain reliable results, individual ' small data ' are often the key keys to analyzing large data. Moreover, the discovery of small data with the nature of ' black swan ' is often an important purpose of studying large data.
3, the so-called ' diversity ', refers to a lot of unstructured data. It is also a factor that increases the theoretical difficulty and has no positive effect on utility. In reality, the more widely the data distribution is, the better the reliability of the conclusion is easily determined from different angles and horizons. So, I would rather think of ' diversity ' as the breadth of the data distribution, rather than the diversity of data forms.
From the point of view of application, the author appreciates the concept of ' data science ': Comprehensive utilization of data analysis, model calculation and domain knowledge to solve practical problems.
For engineers, the purpose of analyzing data is to solve problems. In order to achieve the purpose of analysis, we should take all beneficial methods and collect all useful evidences, and should not confine ourselves to a particular theoretical method. We expect big data, but we like small data: We like complete, real data. IBM has corrected the 4V theory. In the author's opinion, this is very reasonable.
To sum up, the author thinks: When using large data theory in industrial field, we should not be attached to the understanding of ' fundamentalism '. We focus on big data to create value, not to catch up with fashionable theories and fields. In this sense, manufacturing enterprises to study large data, should particularly emphasize the word ' industry ', to distinguish the current popular, business-oriented large data theory.
The ' data mining ' theory has appeared for decades. But there are few successful applications in industry. I think: one of the important reasons is the lack of a suitable data analysis and processing theory. I think: the use of good industrial data need to pay attention to three key points:
1. Reliability. Reliable conclusions can be applied to industrial practice. In my opinion, the so-called reliability, including accuracy, scope of application and the scope of the applicability of the knowable. In reality, absolute reliability does not exist, we can only pursue relative reliability. Relative reliability can be supported by as many independent knowledge or analytical results as possible. To be reliable, you can't just be satisfied with ' relevance ', but try to focus on ' causality '. In this regard, the theory of large industrial data and business data is contradictory. At the same time, reliability requires that we try to use traditional statistical methods with solid theoretical basis--just can't apply these methods blindly, pay attention to the validation and construction of the applicable conditions.
2, beyond the nature. The newly discovered knowledge must surpass the person's understanding, otherwise there will be no value. In the business activities, people's understanding is relatively vague, large data research is easy to get the result of transcendence. In the industrial field, people's understanding of physical objects is often very profound. Superficial research is difficult to transcend human experience. At this time, to let new knowledge beyond human experience, often to be based on accurate quantitative. It is not appropriate for us to find knowledge that is different from experience as a research objective: in the industrial field, it is wrong to know different conclusions from experts. With the exception, it is often caused by quantitative changes-this phenomenon, it is based on the precise quantification of the conclusion of the premise.
3. Embedding. The application of large data must be embedded in the appropriate process. In general, mere gratification in discovering knowledge does not create value. In industrial applications, it is common practice to embed newly discovered knowledge into production and management processes. It is best to use the model as the carrier to promote the intelligent process. As we all know, the application of large business data is generally combined with a new business model. In this regard, large industrial data and business data are interlinked.