Uncover the mystery of the whole data (i)--read the "Big Data Age" (III.)

Source: Internet
Author: User
Keywords Large data age we four months all the so-called


In the big Data age, Meyer-Schoenberg tells us that the first big feature of the big Data age is "not random samples, but all data." In the last chapter, we analyze the simplest information demand of "the number of McDonald's in Beijing", which shows that even in the large data age, random sample analysis is necessary, because the reality is not to have a full data for each problem research.



This article is devoted to the so-called whole data, for everyone to uncover the mystery of the whole data.






What is the whole data?



In the big Data age, all data is a concept that is antithetical to random samples. "First, analyze all the data associated with something, rather than relying on a small sample of data," says Mister Meyer. Thus, the whole data is clearly "all relevant data".



If we want to know how many people in Beijing have eaten McDonald's, this should be the case for everyone in Beijing to eat McDonald's. Unfortunately, we know that this whole data does not exist.



&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; Look at a case of data in the Big Data Age: Albert Laslo Barabasi and his colleagues want to study the interaction between people. So they surveyed all the mobile communications records for four months--anonymously, of course--from a wireless operator who served one-fifth people across the United States. This is the first time in the social level with close to "sample = overall" Data for network analysis. By looking at all the communication records of millions of of people, we can produce new ideas that may not be produced by any other means.



This whole of Uncle Meyer's data is "a mobile communications record for four months" provided by a wireless operator serving one-fifth people across the United States. What does it mean? Plainly, is a mobile company four months of communication records. It is puzzling that, although this is only 1/5 people in the United States four months of communication records, said that "This is the first time at the social level with close to the" sample = overall data for network analysis. ”



How are the "whole society" and "1/5 people in the United States", "sample = Total" and "all mobile communications recorded within four months", and how are they linked?



Also, if four months of data is the whole data, that three months or two months of data is not counted as the whole data?



The seemingly simple totality of data is not as simple here as Uncle Meyer.






The whole data of the past and present life



The entire data in the case above is essentially a four-month communication data in a mobile operator's database. From the large data age of the many applications of the data can be seen, Mister Meyer said all the data, in fact, we usually say that the database data.



"All" may just mean that all the records in the database are included.



Even before the internet became popular, humans had begun to record and  data because of computer and database technology. In particular, some special industries such as banking, telecommunications, and so on, the customer's purchase record was first recorded completely, thus constituting the so-called whole of Uncle Meyer's data.



This is definitely the story of the small data age. In other words, the so-called whole data is not the product of the big data age, the whole data is already ubiquitous in the small data age.



The analysis of the so-called whole data and the basic statistical analysis methods are the common phenomena in the small data age.



The story of the food supermarket beer, which is being talked about, is sold in diapers, and its data sources may not even be the so-called whole data, because the food supermarket did not insist that every consumer should be registered before buying.



The whole data is not "all data" that we think it is, or even "all relevant data" as Uncle Meyer imagines. The whole data is still part of the data, for example, it contains only one company's customer data. The whole data is still sampled, such as four months of sampling in the case above.



Who says sampling must be sampled randomly?






Analysis error of all data



One of the main reasons why Uncle Meyer is angry with random samples is that there are statistical errors and inaccuracies in the analysis based on random sampling and the real situation. Then, with all the data, our analysis results must be no error?



Suppose we do have all the information about eating McDonald's in Beijing. Yes, if there is a so-called total data, the analysis of a single variable does not have statistical error, in fact, this analysis does not use the concept of statistics. However, we spend so much effort on a whole data, certainly not just calculate some percentages, or do some simple unit analysis. We need to do more with this whole data, like predicting which customers will buy the Big Mac next time. The analyst will give us a list of customers and tell us: 75% of these customers may buy a Big Mac next time.



75% possible? That means the customer has a 25% chance of not buying a Big Mac next time. This is the analysis error.



The fact is that, in addition to the calculation of a single variable (not statistical analysis for the whole data), the analysis results are probabilistic in any statistical analysis, and there are statistically significant errors.



But the big Data age gives readers the impression that you don't have to worry about errors as long as you use all the data.






Sampling of all data



According to the big Data age, with all the data, we don't need to sample any more. Is that true?



Interestingly, in the case of all the data analysis that Uncle Meyer gave us, the researchers took only 4 months of data from the database. Why only 4 months of data? Does the database of the enterprise have only 4 months of data?



Of course not!



The fact is that the researchers sampled data from the enterprise's database for four months. So why, even with the "all data", did the researchers take only four months of data?



Because of the data analysis, it is definitely not the more data the better. Even though we have unmatched computing speed, too much data can waste researchers ' time resources, and unnecessary data may even affect the results of the analysis. What's more, according to Mister Meyer, 4 months of data sampling is enough to get the results of a satisfactory study.



It seems to have all the data and it is necessary to sample data.






The big Data age, which is very difficult to understand, is the antithesis of random samples to database data and as one of the most striking features of the age of large data. And in order to show the opposition of random samples, it is unscientific and even dangerous to call database data the whole data. Please pay attention to my further analysis of the so-called whole data.





Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.