It is always lonely, it has no bright appearance and ups and downs of the story for people to spend. But with the "big Data" of the East Wind, it became hot. All walks of life start using "Big data to tell you ..." to make sentences. Along with the logic of "all companies are IT companies," then, "everything is about big data."
While the big data is being touted, there are also criticisms. Recently, a "big data, Anli" article summed up a few foreign criticism of large data abuse, the original appendix is as follows:
1, insignificant significance: no theoretical http://www.aliyun.com/zixun/aggregation/14294.html "> 's Big Data is fur, only see significant correlations, but without testing, no theory, such correlation is meaningless, May be false. The key is: too many data points in large numbers, it is extremely easy to find a significant relationship between two vectors in the calculation, but it is difficult to control false relationships because of the large amount of data, which is a dilemma. I have an article cast out, anonymous review said: The sample is very large, of course, can find significant correlation, but see no meaning.
2, sampling method problem: Statisticians Fankaisa summed up a phenomenon, Google, Facebook and other networks collected data, often not homogeneous, is at different times with different resources collection, and then the entire data merged, resulting in large data inside many parts of the data is not collected in the same way, The basic assumptions of statistical sampling were reversed. and online data and offline data are inconsistent, such as the Wall Street Post's electronic version and cardboard is different, and users can customize content.
3, machine language instability: Google began to use keywords to predict the cold epidemic area, began to say more than the CDC forecast, but later more and more inaccurate. Some people think that this is Google's search algorithm is constantly improving, so the automatic collection of data is unstable. In addition, if the machine language is misled will be more wrong, such as Google translation is based on the real article summed up, but some network "real" translation is actually Google turned, so Google will put their own translation based on these "real" article.
When companies refer to large data, they often want to collect all the data and analyze it, which is also an ideal scenario for large data analysis applications. But in many cases, companies are subject to technical and cost constraints and are still using sampling analysis. In actual sampling, it is necessary to have a weight in the case of stratified sampling, and the weight is inversely proportional to the probability that the layer is chosen. A layered weight high, in the analysis can not be ignored. The problem with large data is that it can only collect data with low weights.
Real life is the case, the most easily studied objects are often the most boring, psychology often find college students to do the experiment, so now the students as samples of the article is difficult to publish. Therefore, sometimes large data, although large, is often unimportant.
Coincidentally, "Black Swan" book also said that the decision to change the majority of the Pareto distribution, not the bell-shaped distribution, which on the surface seems to coincide with the "data larger and less important" view. But the fact is that users of large data have higher requirements, how to select a seemingly unrelated variable in the vast number of data linked together to draw conclusions.
Large data, because of the loose concept and lack of theory, is filled with too much illusion. How to combine specific application scenarios to meet business needs is the right direction for large data technology in enterprises.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.