Big data misunderstanding in "Dongguan migration"

Source: Internet
Author: User
Keywords This very very very very large data

CCTV February 9 exposure Dongguan sex industry, a stone stirred thousand layers of waves. That night, a set from the "http://www.aliyun.com/zixun/aggregation/12669.html" > Baidu Migration "Large data analysis of the network map was hot turn." The figure simply and directly shows the ten most popular cities that moved out of Dongguan and moved in 8 hours before 10 o'clock on the night of February 9. Although the original text did not explicitly read, but at this point of the netizens have forwarded, the tacit belief that this is a "Miss John fled the picture."


from the process of the whole thing, this is a very classic case of large data application. The first conclusion comes from the data and the data is large enough then the use of large data analysis methods, of course, this specific model and algorithm Baidu to help you do, and then use the most fashionable and cool visual way to show out, and finally from the results of the data released the actual want to conclusion, everything is so perfect.


The big data is better than many textbook examples. From the analysis of the method, the logic is also very strict, to study the direction of the CCTV after the impact of exposure, this is a professional some call intervention analysis. With regard to the possible outcome of the impact, a matter of opinion, the study selected a public interest in the conclusion that the whereabouts of the fleeing clients. The research selection method is also in place, directly using the Baidu Migration Visualization tool, from the data to the conclusion of the complete steps have.


So, in the big Data age, does a rigorous and perfect analysis process represent the right one? The answer to this question involves a very important nature and misunderstanding of large data, not necessarily with large data.


in fact, whether large or small data, the nature of the data analysis must be the method to match the hypothesis, the model and the data. Needless to say, we use this example to restore a truly large data analysis process.


first of all, what is the data of this migration map, does the reader and "analyst" really know? According to the information provided by Baidu, the data from the lbs (location-based services) open platform, we delve into, in fact, from the mobile client. Baidu's development platform is written very clearly, providing Android, Symbian and IP positioning interface, simple point, everyone through mobile terminals to call Baidu map or other services based on Baidu Map, will be recorded by Baidu, and then use the data for analysis.


But what is the real data for the migration map? Has Baidu told the public directly? Actually there is no. Using the interface data, there are at least two ways to draw the migrated graphics, the first is through the record location request, through each user at different time position of the trajectory to define a migration process, the second is through the path planning interface to record the real migration start point.


The advantage of the first way is that the data is large and the actual displacement, the disadvantage is that it is difficult to distinguish between the midpoint and end of the journey. The advantage of the second way is that the starting point is very clear, the disadvantage is that the amount of data is small and many data is not happening. In fact, from the existing data, it should be the first way, but the specific details of the processing Baidu has not been announced.


this way to delve into the fact that there are many problems, for example, from Wuhan to Dongguan, basically have to go through xianning, then Wuhan and Xianning flow into Dongguan How to calculate the traffic, need a clear definition, Baidu nature is some, the public is not aware of nature, but from the heat of the picture to see, Wuhan and Xianning are the first 10 cities that flow into Dongguan.


Through this example, I just want to say that most people think they know the data in large data, actually do not know clearly, then these data can come to how strong conclusion, in Baidu does not fully disclose all its details, the public is not able to understand very well.


the way in which any detail is handled may have a significant impact on the conclusion. Take this simple example, Baidu's migration map did not provide enough information for users to carry out in-depth analysis, just to show a trend of the general picture, if strongly hinted that they can use the big data will be able to get the correct conclusion, is clearly wrong.


about data sources may be more complex, and later explanations are less complicated. What we're going to talk about here is the question of selective samples. Through the introduction of the previous data sources, whether or not the real understanding, at least we can understand the application of the data is only a part of the sample, said simply can represent the use of mobile terminals to open the Baidu lbs service users, said the complexity is also related to the caliber of Baidu metering. At any time to use a statistical method to draw a conclusion is obviously for the overall, but we use the sample to infer, how the representation of the sample determines the quality of the conclusion.


Before the migration in Dongguan, the application of Baidu has been very famous, the first of course because of the Spring Festival. About the Spring Festival Travel also has a joke, said a television station in the train asked you to buy the ticket, the result came to everyone to buy the ticket conclusion. This example everyone knows is a joke, in fact, is the problem of selective sample deviation. Return to the example of migration in Dongguan, the problem is the same, but we do not as a joke.


We'll talk about another more serious problem, also starting with a joke, there's a conclusion that drinking more milk is more likely to get cancer. That's a scary conclusion, but if you collect data on milk consumption and cancer ratios across regions, even a simple diagram can be seen as positively correlated.


This problem is what, I believe many people have seen, that is missing the key factors. Generally speaking, the economically developed region milk consumption will be relatively high, and because of the pace of life and environmental pollution, the proportion of cancer will be relatively high, that is, the key factor is the regional economy is developed, rather than simple milk consumption and cancer relationship.


return to Dongguan as the example of migration, Dongguan as a GDP ranked in the forefront of the small cities, is very extraordinary, the annual attraction of migrant workers is not a small number. The number of people involved in the sex industry is only a small proportion. In terms of dimension, the influence of CCTV exposure events on population migration is not necessarily comparable to random error.


finally we go back to the data itself, many readers looked at the top 10 cities rankings, but did not look at the proportion of the value, take out the city, the first three Hong Kong, Ganzhou, Chenzhou proportion of more than one-tenth, while the proportion of other cities is very small, the tenth place in Zhangzhou only 19 per thousand, It is meaningless to tangle with other cities.


We look again at the top three of the three cities, even if the inquiry at the time of writing (February 10 23 o'clock), is still the first three, indicating that the date of the data ranking does not prove that the CCTV exposure to the time series has a significant impact. We look at the immigration data of Hong Kong, Ganzhou and Chenzhou, and the top 10 have no Dongguan, so even if there are any unusual places in the three cities, it is not necessarily the result of Dongguan.


in any case, the example of "Dongguan migration" is definitely a good example of large data, its value is not in the conclusions of the net, but can clearly explain a real large data analysis process and the common people of large data analysis of misuse, large data analysis is not a panacea, No matter what the analysis should be based on scientific methods, otherwise it will be a strong misleading to people, which is not worth the candle.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.