This paper then continues to carry out more analysis of the whole data.
More samples of all data
In the example of the whole data given by Uncle Meyer in the previous article, the analyst took only four months of data from the database for analysis. Why? Because the task of analysis is not to get a long-term relationship between each customer in the database, but through a certain period of time http://www.aliyun.com/zixun/aggregation/7185.html "> Interpersonal analysis, Understand the impact of individuals with different relationships on the entire Community network. Therefore, proper sampling of phased data is necessary.
Imagine that if the researchers used all of the data in the database, more human relationships might be involved, which might affect the results of the study. Therefore, the application of all data without distinction is not necessarily the best choice.
An example of sampling analysis of the whole data is given. I used to do a search engine algorithm analysis of the application, the principle is based on random sampling of keywords, to the major search engine (the United States) up to crawl the search results of the Web page, analysis of various SEO technology on the Search engine page ranking impact. For a long time, I climbed the Web database also became Uncle Meyer called the whole data. Should I use all the data for every analysis? Of course not. Because the search engine is constantly changing its search rankings algorithm, if I include the outdated ranking page information in my search engine ranking key factor analysis, it will backfire and lead to inaccurate analysis results.
The same is true of the data analysis that Mister Meyer has raised several times about aircraft fare forecasts. Airlines may change the decision mechanism for their ticket prices. If you include information about the outdated fare determination mechanism in the price forecast analysis, the results of the analysis will be interfered with and increase the error.
The data is not absolutely the better. Even the whole data should be sampled as necessary according to the analytical task. The reason may be multiple, and proper sampling is an option to optimize the analysis process and analyze the results. Also, sampling is not limited to random sampling.
The trap of the whole data
The first pitfall is the so-called whole data, which in most cases is not "all". Let's take a look at the absolutely heavyweight internet companies that are most likely to have all the data, such as Google, Baidu, FACEBOOK, and Taobao, which company's database can be called "all"?
More traps are not due to the name "All", but the name definitely increases the depth of the trap.
An enterprise with a database, often more willing to limit their own database to conduct a variety of analysis. There is an old saying called "Reap the fruit". This whole data analysis trap is: If you plant a melon, you can not analyze the beans.
For example, a news site often use very yellow very violent news to attract netizens to download its news app. Over time, its app users may be "yellow shirts." If you want to use this "all data" analysis to learn how to sell red shirts among them, it must be wrong.
Let me give you a simple example. For example, by analyzing the whole data, you can conclude that a certain product is your customer's favorite. But is this actually the case? Perhaps the product that the customer likes is not in your whole data at all, so how do you analyze also can't get your customer favorite kind of goods at all.
The world outside is wonderful. You often need to jump out of the whole data to experience the wonderful outside world.
All data and random samples
For some reason, in the view of Uncle Meyer's worldview, in addition to the whole data being random samples, Bailao vs. Shiren, class struggle is absolutely irreconcilable.
But that is not the case. Even with all the data, random sampling questionnaires are needed, even necessary.
Because the whole data is hardly any real "all" data, it is impossible to contain all the information we want to know, so it is often necessary to get more information on the basis of the whole data. One of the sources is docking with other "whole data", for example, in the United States can be based on the personal social Security number of personal credit information docking, another way is in the "whole data" random (or other methods) to select part of the sample, and then the customer questionnaire to supplement the information missing in the database, And then through the docking of the questionnaire survey information into the analysis of the whole data.
Such an analysis should not have been heard, or he would not have taken the random sample with the whole data so absolutely opposed. But such analysis is widely used in the small data age.
"Not random samples, but all data." This is the most famous age feature of the big Data age. I used three articles to analyze the random samples and the so-called whole data. Before the end of this article I will summarize:
1 The so-called whole data, in most cases only refers to the enterprise database data;
2 The world may not have the so-called whole data that can solve various problems;
3 random samples and the so-called whole data is not the absolute confrontation between the Life-and-death, but can coexist peacefully, or even complementary;
4 The so-called whole data and the analysis of the whole data, early in the small data age has been widespread;
5 Random Sample analysis will continue to show its existence value in the large data age;
6 Even the so-called total data, it is often necessary to do more effective analysis by sampling;
7 When analyzing the so-called whole data, it should be thought that the outside world may be more exciting.
In short, random samples and so-called total data (in fact, database data) should belong to two different concepts, if the opposition is logically problematic. More importantly, whether it is a random sample, or the so-called analysis of the whole data, should not be an era of representatives.
Random samples can not represent the age of small data, the so-called total data is not representative of the era of large data.