In today's popular Internet society, almost everyone will contact with statistics, such as various economic data, securities information, real estate investment feasibility report, corporate financial reports, as well as the Internet-related various page data clicks, web traffic, user statistics, user trend analysis reports, etc. Data analysis is affecting our lives in ways that we have never imagined; however, the large number of statistical data, statistics, because of subjective and objective reasons for abuse, it is difficult to play the role of describing facts, transmission of information, on the contrary, often misleading readers, At the same time, the problem is that more and more people will pass the data fraud to deceive the people who are not particularly knowledgeable about the knowledge of the data, so that they can achieve the purpose behind them.
In the book "How to Lie with statistics", which I read earlier, we mentioned that when we contacted a statistic, 5 simple questions to distinguish most of the statistical data, namely: Who said, how he knows, what is missing, whether someone has switched the concept and this information is meaningful.
Says who?
Often you encounter situations where you use a data graph for problem descriptions, and then we tend to focus on what the data really means and ignore the source of the data and its timeliness. When it comes to the source of the data is a certain authority, authority, these words are often to cover up the real source of information. Although some data graphs do cite authoritative data, it is possible that only some of the data is intercepted by ulterior motives, although the data is authoritative and credible, but the conclusion is added by itself, and the result is that the results are completely opposite to the original data. In addition, when asked the data source, must make up to ask when this is the data, the data is very time-sensitive, if the previous data to explain the current phenomenon, can also lead to erroneous conclusions.
For example, the following two charts are six months before and after the use of picture software to do the survey, you can see the changes are very large; if we also want to do a new picture software, reference to these two different time survey chart may lead to a different product positioning.
So when we see a statistical chart, the first thing to think about is where the chart comes from and when the chart is, we should ask, "who said that?" "Then we should add a second question: How does he know?"
How did he know?
The main is to see how the data is obtained, that is, the investigation of the sample is large enough, the sample is biased, the survey of the population is covered by all users.
Here are two points for the player user to do the Highlight function survey, one is the result of sample size 100, one is the result of 2000 sample size, in the sample size difference will be very large.
In the Internet product design, there is a more common problem is that when a design or a function is uncertain, often directly ask the advice of the surrounding colleagues, but this does not represent the entire user, resulting in the deviation of the results.
And when a new product is released, often do product usability testing, the conclusion is that half of the users in a function of the operation of the problem, may feel that the problem is very serious, in fact, is probably 50% behind the test of two users, one of the users encountered a problem.
Is there anything missing?
That is to see whether the factors that affect the conclusion are enumerated. For example, the survey shows that the average monthly salary of a company's employees is 20,000, the survey covers all employees of the company, outside a look, wow, the company's employees pay well, in fact, the original data did not leak out, the company has 100 employees, the general manager of the salary is 1 million, and the remaining staff average wage is 10,000, a mean, Say that the average monthly salary of the company is 20,000.
For example, in doing a competition between the satisfaction of the survey found that their product satisfaction is significantly higher than the competitive products, everyone looked very happy, but ignored the survey method, in fact the object of the survey is the most often use their own products users, the results must be self-evident.
Again, if the satisfaction of the problem, if the user conducted a product satisfaction survey, the result is 85 points (percentile), may feel that the product is not bad, but the lack of comparison with the competition, 85 points in the end is how a level, it is not known, the actual situation is the satisfaction of users of competing products are more than 90 points. The following two sheets are only their own product satisfaction and competitive satisfaction chart, the effect is very different.
Have you changed the concept?
When looking at statistics, whether there is a substitution of concepts from the whole process of collecting raw data to drawing conclusions. For example, when collecting data, questions are asked about disposable income, the conclusion is that income, the question is the use of what products, concluded that the use of what products are often used, the actual investigation only for a certain number of factors, the conclusion is not added to the attributive limit, people feel that is the overall description of the situation, like the current domestic university rankings, Different agencies use different indicators to discharge different results, the actual publication of the indicators used to not mention, the result is often misleading and confusing readers.
I was impressed by the 2008 Olympic Games after the four major portal sites have said their own during the Olympic Games during the first, so that users can not touch the mind at the same time let the industry doubt. The first reason for this result is that different companies are ranked differently indicators are "user access", "Web traffic", "average per user stay time", "Access Speed", "the number of interviews," and so on, so that four major portals can be claimed in the Olympic report on the first The second reason is that the referenced data source is different, resulting in differences in data, even different companies cite the same research company's data are not the same, excerpts from one of the research company's explanation: "Sina, Sohu with us two times different survey data, these two surveys of the city range, methods, etc. are not the same, The results of both data are not comparable at all. Sina released the result is our 128 cities in China to take computer-assisted telephone interview results, and Sohu released the result is our in Beijing, Shanghai, Guangzhou, Qingdao, Nanjing, 5 major cities to take the way of the survey results. The 5 most important cities and the other 128 cities of the Internet penetration rate, people's preference for the network are not the same, the data results reflect the things are certainly different, "the average netizen in the attention to" first "and will pay attention to these behind the data?
The other is the same data, but the chart's datum, the scale and so on, also can cause the chart to express the effect to be different, for example below two graphs, the left first eye gives the feeling is the Internet time difference between 2 users is not big, but the right this gives the person's feeling is very different.
Does this information make sense?
A lot of statistics can be seen in error at one glance. For example, as a result of the BT incident, an investigation agency declared: in their random survey of 100 netizens, 87.53% of netizens support the ban btchina; sometimes in the classification of users, for the classification results, divided into different categories of users can be found in the reality of the corresponding crowd, Or whether everyone around you can find their own category, it is a glimpse of whether it makes sense.
Finally, one of the most common but most often misguided two examples:
A lot of people must have heard the teacher have been in the school time this kind of calculation: 1 months away from so-and-so examination, buckle to a day 8 hours a total of 10 days of sleep, buckle to a day about 4 hours a total of 5 days of meal activities and other time, and then deducted two days a week for 8 day of the weekend, then the remaining study time is only 7 days, At this time I feel very nervous, but it is not so short ah, in fact, we were fooled by the teacher; a product development project plan would have been 1 months, and later because of some changes, demand planning time to increase by 15%, interface design time to increase 20%, development time to increase by 10%, test time to increase by 5% , the total time is increased by 50%? The actual total time increase is certainly less than 20%.
In this era of information explosion, statistics is a powerful tool for revealing the nature of data, but unfortunately statistics do not necessarily reveal the truth and sometimes can be an accessory to illusion. When we are confronted with all kinds of statistics in our lives, we need to keep a little more sane and sober, and look at the problem with some reservations. For "If a man starts with all kinds of affirmative arguments, he will end up with all sorts of suspicion, but if he is willing to start with a skeptical attitude, he will surely get a positive conclusion." ”
(This article originates from Tencent CDC Blog, please indicate the source when reprint)