Big Data Thinking

Source: Internet
Author: User
Tags lenovo

The mindset to change in the big Data Age:

  • To analyze all data, not a small sample of data
  • To pursue the intricacies of data, not accuracy
  • Be concerned about the relationship of things, not the causal relationship
  • So far, people have limited ability to collect data, so the use of "random sampling analysis".

    For example, to know that Chinese customers are Lenovo notebook satisfaction, it is impossible for all the people who bought Lenovo notebook survey. The usual practice is to randomly find 1000 people, using the satisfaction of these 1000 people to represent all.

    To make the results as accurate as possible, we will design the questionnaire as accurately as possible and make the sample random enough.

    This is the "small data Age" approach, in the case of the impossible to collect all the data, random sampling analysis in various fields have achieved great success.

    However, there are three problems with random sampling:

      1. Dependence on randomness, and randomness is hard to do. For example, random calls to 1000 families using fixed phones are also lacking in randomness, as they do not take into account the fact that young people use their phones.
      2. It's good to look at a distance, and once you focus on a point, it blurs. For example, we use 1000 people to represent the whole country, these 1000 people are randomly selected from the country. However, if this result is used to judge the satisfaction of Tibet, it is lack of precision. In other words, the results of the analysis cannot be applied locally.
      3. The results of the sampling can only answer questions you have designed beforehand, and cannot answer questions that you suddenly realize.

    In the "Big Data Age", sample = overall

    Today, we have the ability to gather comprehensive and complete data. Big Data is based on mastering all the data, at least as much data as possible.

    2. Pursuit of confounding, not accuracy

    In the "Small Data" era, the most important thing is to reduce the error of measurement, because the information collected is less, so it is necessary to ensure that the record as accurate as possible, otherwise minor errors will be magnified. To be precise, scientists must optimize the measuring tools. That's how modern science is developed, says physicist Kelvin (international Unit of temperature): "Measurement is cognition". Many good scientists must be able to collect and manage data accurately.

    in the "Big Data" era, the use of all data becomes possible, and usually tens of millions of data, to ensure that the accuracy of each data is unthinkable, the confounding is unavoidable. However, when the amount of data is large enough, confusion does not necessarily lead to bad results. And by loosening the standards of fault tolerance, you can gather more data and use that data to do more. To give an example:

    To measure the temperature of a vineyard, if there is only one thermometer, it is necessary to ensure that the meter is accurate and capable of working. But if there is a meter for every 100 vines, some of the measurements are wrong, but all the data can be combined to get a more accurate result.

    As a result, "big data" usually speaks in probabilities, rather than the "hard-and-sure" face. The "Big Data" era requires us to re-examine the merits of precision. Because the amount of data is too large, we no longer expect precision or accuracy.

    In the library we can see that all the books are categorized, for example, to find a C-language book, you must first find the "Engineering" category, and then find the "computer" category, and then according to the number (similar to 803.53x) to find the required books, this is the traditional method. If there are fewer books in the library, you can search so, if there are 100 million copies? What about Ben 1 billion? The data on the network can be more than the library's book volume comparable, moving billions of, if the use of clear classification, then not only the classification of the people will be crazy, the query people will be crazy. Therefore, the Internet is now widely used "tags", through the tag to retrieve pictures, videos, music and so on. Of course, sometimes people get the wrong label, which makes people who are accustomed to accuracy miserable, but accepting "chaos" brings us two benefits:

      1. With a much larger number of tags than the "category", we were able to get more content.
      2. You can filter content by combination of tags.

    For example, if we want to retrieve "Cynanchum paniculatum". "Cynanchum paniculatum" at least three kinds of identity: is a Chinese herbal medicine, is the name of the person named Herbal medicine, is one of the protagonist of the Paladin 3. If according to the traditional classification, perhaps "Cynanchum paniculatum" will be divided into the "Herbal medicine" category, which also depends on the classification of the people. Then the query will not know that it has another two identities, or just want to check "Cynanchum paniculatum" This person's people do not go to the "Herbal medicine" category to inquire. However, if you use "label", then enter "Cynanchum paniculatum" + "herbal", you can find the herbal medicine; enter "Cynanchum paniculatum" + "Paladin 3" to find the protagonist of the game.

    Therefore, the use of "label" instead of "classification", although there are a lot of inaccurate data, but has been a large number of labels, making the search more convenient, the results are better.

    3. Focus on related relationships, not causal relationships

    know "What" is enough, there is no need to know "why", to let the data themselves "voice." take a look at an example:

    Wal-Mart is the world's largest retailer and has a large retail sales data. Through analysis, Wal-Mart found that not only the sales of flashlights increased, but also the sales of egg tarts before the advent of the seasonal hurricane. So when the seasonal storm comes, Wal-Mart puts the egg tart in stock near hurricane supplies to make it easier for customers.

    See here, a person immediately asked, "Why did the hurricane come, people want to buy egg tart"?

    You ask "why", which means you pay attention to causality. And this "cause" may be extremely difficult to analyze, and complex, and even if the study, the significance is really big? For Wal-Mart, as long as the "hurricane is coming, put the egg tart, ready to make a big profit" on the line, this is the focus of the relevant relationship.

    The hurricane is related to the egg tart, OK, okay, it's good to make money. Why? Whatever, anyway.

    This is also the thinking that the big data era needs to change, that is, to focus on related relationships, not causality.

    By exploring "what", not "why", can help us to better understand the world. However, because causality is ingrained in our thinking, and sometimes it can be imagined that some causal relationship, but it brings the wrong cognition. For example:

    Parents often tell their children that cold days without hats and gloves. However, studies have shown that there is no direct link between colds and wear. After eating in a restaurant and having a stomachache at night, we will think of the reason why the food in the restaurant is problematic. It's probably a handshake with someone, or a relationship without washing hands before meals.

    The relationship gives us a new perspective on how to analyze the problem, and we don't need to explore why, and it makes us believe that it is reasonable not to explore "why".

    However, it is not to say that causality should be completely abandoned, but to be flexible in the position of the relevant relationship to think about the problem.

This paper summarizes from the era of big data, author Viktor Mayer-schonberger.

Big Data Thinking

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.