Big Data, Da-an

Source: Internet
Author: User
Keywords Large data can very Amway

Recently in the community to scrape up a large number of unhealthy tendencies, undergraduates also dare to hold a few G hard drive claims that these data can solve such difficult problems, let people think of the Virgin hard drive Yellow film said this guy old cool.

Although in the field of social science is far less popular than computer and engineering, Google academic I use the keyword search, large data and social science content of the article 2011 is 194, 2012 635,2013 years 1820, the two years is about 1.2 of the index growth. A topic of one thousand or two thousand articles a year is not much, by contrast, "social stratification" 3721.html ">2014 years have not finished over more than 16,800, but the big data this topic on the Internet biography is very god, Especially in our country this everyone knows a little but also know not fine land, a kind of good products ready to replace the traditional feeling of marketing.

Unsanitary for the yard farmer there are too many data in the world, what would have been considered to be coal-dried slag now seems to be a diamond, and a pit-worker waving a hoe shouting: Hooray for data mining! but I think it is also true, with the progress of technology, corn can also replace gasoline, coal slag can also be made jewelry. Big data is good for engineering, but miners take coal-dried slag when diamonds are marketed to the social sciences and say this is a substitute for statistics and sampling techniques. Physicists have a lot to say about big data, but I don't know physics.

Foreign abuse of large data in other areas has been a lot of criticism, I summarize the main:

1, insignificant significance: no theory of large data is fur, only to see a significant correlation, but without testing, there is no theory, such correlation is meaningless, perhaps false. The key is: too many data points in large numbers, it is extremely easy to find a significant relationship between two vectors in the calculation, but it is difficult to control false relationships because of the large amount of data, which is a dilemma. I have an article cast out, anonymous review said: The sample is very large, of course, can find significant correlation, but see no meaning.

2, sampling method problem: Statisticians Fankaisa summed up a phenomenon, Google, Facebook and other networks collected data, often not homogeneous, is at different times with different resources collection, and then the entire data merged, resulting in large data inside many parts of the data is not collected in the same way, The basic assumptions of statistical sampling were reversed. and online data and offline data are inconsistent, such as the Wall Street Post's electronic version and cardboard is different, and users can customize content.

3, machine language instability: Google began to use keywords to predict the cold epidemic area, began to say more than the CDC forecast, but later more and more inaccurate. Some people think that this is Google's search algorithm is constantly improving, so the automatic collection of data is unstable. In addition, if the machine language is misled will be more wrong, such as Google translation is based on the real article summed up, but some network "real" translation is actually Google turned, so Google will put their own translation based on these "real" article.

The above is the contradiction between man and machine: The data must be guided and collected by theory, otherwise there will be fallacy. These can be avoided or improved, but they are enough to make big data difficult to establish in the social sciences in the short term. In addition, I have an idea, based on the assumption that large data is impossible in the field of human behavior, the study of text or death of history, linguistics may be, but sociology, criminology, anthropology, these three are difficult.

The study of sampling is clear, as long as you determine the desired accuracy in Figure 1 Z (A/2) ^2, Variance s, the answer rate R, the basic can be calculated from a crowd should be smoked how many samples to be representative, and the total number of people n the impact of the last is not big. In the case of 95% confidence interval, a small town 4000 people, a city 100,000 people, from the town to smoke 360 people can be representative, from the city of 390 people can still be representative, it is not possible because the latter more than hundreds of times times to smoke hundreds of times times more people. So large data is not necessary first, in the satisfaction of accuracy, small samples and large data, the effect is no difference, and not satisfied with the accuracy of the time, large data error will only be greater.

This is only the most basic case, the actual sampling often requires layering, Erlonghu has 10 corn, some area big some small, some inside have illegal sex trade, want to find that piece of corn ground sex trade, have to divide 10 corn to fall into two categories: from the people near, from the people far away, give the latter the sampling probability to be big. This is called stratified sampling, and in reality almost all large-scale sampling is a variant of stratified sampling.

In the case of stratified sampling, the late statistic operation must be weighted W, as Figure 2, the number m and N of each layer are not important for the moment, and the weight is inversely proportional to Phi: Phi is the probability that the layer is selected. A layered weight high, in the analysis can not be ignored. The problem with large data is that it collects only low weight data:

We know that the Pareto distribution, the application is very wide, from the lady to pick up the distribution of property distribution can be expressed by Pareto distribution. Another zipf curve similar to the power distribution, P (R) =1/(R*LN (R)), used to indicate importance and frequency: in linguistics, the frequency of daily use of a word is inversely proportional to its rank, and Chinkafir is the 10,000th word, The probability of its occurrence is about 1/10000. Because of the breadth of this distribution, I have a weight based assumption: Because the lower the sampling probability of layering, the higher the weight, so the more difficult to sample the population, the higher the statistical importance. In reality, the most easily studied objects are often the most boring, psychology often find college students to do experiments, so now the article is difficult to publish the sample of college students, and who will be in Erlonghu with Hao brother mixed days, do out of the study even if not very tight also remains important.

That's what I'm talking about. The second major weakness of big data, the larger the data, the less important. One person collects a bunch of 0 middle-class people's attitude towards violent crime, while the other is in Cicero and correlation kings for two months, who do you think is the important conclusion? Not that the former does not make sense, the general population in the analysis is necessary, but large data can only access to some data, The absence of sampling techniques is never representative. Like Amway, the product may be good, but the marketing method is often too stupid, want to replace the tradition still need to work hard.

(Can be reproduced with name, plagiarism Sima haha)

Original link: http://www.douban.com/note/422224292/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.