Avoid thinking traps in large data analysis

Source: Internet
Author: User
Keywords We we small gauge we small gauge can we small gauge can avoid we small gauge can avoid large data analysis

Large data analysis dates back 30 years, when in the data analysis world, it was thought that the tools and algorithms of data analysis could be http://www.aliyun.com/zixun/aggregation/10221.html "> depth analysis of anything, What is missing is the amount of data. Data analysts say if you can let me measure all the data, trace all the data, from microscopic precision to minute sales, to everyone's resource consumption, to macro variables such as interest rate changes, I can tell you what you want to know, the correlations between these variables, Their changing trends and everything.

This statement has been the mainstream data analysis community view. Today, the volume of data is no longer a problem. The internet can almost find any data you need. Want to know the relationship between the sale of Pennsylvania State's industrial cleaning equipment and the use of equipment in the state's steel mills? No problem, want to improve customer satisfaction? The user complaint data can be clustered by clustering algorithm. You move the mouse, a lot of data can be found.

The "Rashomon Gate" of large data

Well, now the problem is not enough data. Analysts can't say "My analysis is fine, as long as there is enough data." "Today, the data is plentiful enough to meet the needs of any analytical method. Instead, analysts need to think about "what kind of analytical approach is best" and "what the data can tell us."

This naturally brings up another problem that may be the real problem with big data. That is: the existing data, many can make you want to analyze what results, you can analyze what results.

There is a sentence called: "There are two kinds of lies in this world, the first is called a lie, the second is called statistics." Our brains have an unparalleled ability to discover the law (that is, there is no law).

A professor at the Darden School of Business has done one such experiment in his class: He has found two students, one of whom is using a random number generator to generate a series, each number in the series is a random integer between 1 and 10. Another student writes a sequence of the same length, each number in the sequence, and the student can randomly write an integer from 1 to 10. The professor asked the third student to show him the series of the two students. Almost every time he was able to correctly determine which sequence was true and which sequence was written manually. Those that appear to be regular or often repeat numbers are random sequences. and the manual written sequence, as far as possible to avoid the occurrence of regularity or repetition. And why? Because we always subconsciously, will think that there is regularity or repetition of things, there must be a reason for it, it can not be random. So when we see any pattern that's a bit of a rule, we think there must be some random factor.

This subconscious actually comes from our survival instinct in nature. When you see the grass shaking, you would rather think that there is a tiger over there, than the "random" wind blowing, and finally jumped out a tiger to strong.

"Small Experiment" to verify "big Data"

How can we avoid falling into this cognitive trap? The "small-scale experiment" advocated by Darden School of business professor Jeanne Liedtka can be used. The difference between "small scale experiments" and "Big Data Mining" is that "small-scale experiments" are specifically designed to verify the correctness of the laws that are "discovered" by means of analytical tools (or the imagination that is helped by analytical tools). The key to designing small scale tests is to use examples to verify the rules you find. If the validation results are correct, then the credibility of the rule or pattern increases.

Why "small scale"? Because, in the massive data plus the analysis tool, may let us discover innumerable laws and patterns, but to each law or the pattern verification will devote the resources (time and the money). By reducing the size of the experimental data, we can validate more possibilities faster and more effectively. This will also speed up the innovation process of the enterprise.

How to carry out "small-scale experiments", according to the specific circumstances. In general, the experiment uses datasets for large data analysis. Remove a subset of the analysis, found the law, through another subset of data validation, if the law in the validation of data subsets also exist, and then use large data collection methods to collect new data, further verification.

Insurance company Progressive Ping and credit card company Capital One are two companies that use data analysis to successfully gain competitive advantage. In their practice, they used such "big data, small experiments" in such a way that they realized the dangers of our innate ability to discover "non-existent" patterns, so they used small-scale experiments to enable them to do data mining quickly and efficiently.

Massive data plus analysis tools make data analysis a hot topic now. Many companies believe that data analysts can "touch gold". But as the saying goes: "What people see is what they want to see." "Today we have a huge amount of data and an analytical tool to" find any pattern, "or we can't forget the oldest approach--with small-scale experimentation. Otherwise, tens of millions of of millions of dollars in large data investment, may find only the "law" we imagined.

(Responsible editor: Schpeppen)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.