Large data: From causal analysis to relevance analysis

Source: Internet
Author: User
Keywords Big data Google can

Big data is no longer the subject of computing and statistics, and the widespread use of business schools has shown that big data is officially coming into the wider application of the industry. "Most of the big data is irrelevant noise," says Natte Silver, a statisticians in the famous "Signal and noise" book. Unless there is good technical information to filter and deal with, otherwise will get into trouble. ”

Duke University's rich-card business school this fall began to recruit large data business analysis of the postgraduate students, the Xian Institute of Management will also take the overseas large data analysis of the PhD candidates as one of the focus of the new admission of teaching staff. Big data is no longer the subject of computing and statistics, and the widespread use of business schools has shown that big data is officially coming into the wider application of the industry.

"Most of the big data is irrelevant noise," says Natte Silver, a statisticians at the famous "Signal and Noise" (Nate Silver, the Signal and Noise). Unless there is good technical information to filter and deal with, otherwise will get into trouble. "In other words, big Data provides us with a new way of looking at the world, but it is often similar to crude oil, and without the refining and application of business schools, it cannot turn into modern industrial products like petrol, adhesives, aspirin and lipstick." For big data, our era today, like Texas's discovery of oil fields, is widely used and consumed in the information age, and requires the full collaboration and change of thinking in all disciplines, just as the discovery of oil fuels the energy revolution of the industrial Age.

From causal analysis to correlation analysis

In the "Pre-Information Age", business schools are limited in their analysis of consumer behavior, market structure, competitive dynamics, organizational behavior, and supply chain management. Because the collection of consumer, employee, stock, factory and other data is very time-consuming, need to bear various costs. Even giant companies like IBM the ability to input the text of the People's Daily into the computer, to try to decipher the language structure of Chinese, such as the realization of Chinese phonetic input or translation between China and the United Kingdom, this technology in the 90 's breakthrough, but slow progress, there are still many problems in the application.

Google has taken a different approach to entering the market, not relying on high-quality translations, but using more data. The search giant collects translations from a variety of corporate websites, texts from every language in the European Union, and translation documents from a huge book-scanning project. Beyond IBM's millions text analysis, Google's big data is measured at level 1 billion. As a result, its translation quality is superior to IBM, it can cover 65 languages, and the quality of translation is optimized continuously in the cloud. Google's messy big data beat IBM's small amount of clean data.

So how can messy large data be refined and applied to oil? An important paradigm shift is the transformation from traditional causal analysis to correlation analysis. In the traditional statistical analysis, an important factor is the reliability of causality, in the limited sample, scientists often use a variety of professional statistical software in the hypothesis test, according to the probability P value (p-value, probability) for the test decision. The P value reflects the probability of the occurrence of an event, typically in P < 0.05, which confirms that there may be a causal relationship between the two variables.

But the advent of large data has changed the test of causal relationships that are generally pursued in the scientific community. Large data is primarily about relevance, not causation, which fundamentally alters the pattern of traditional data mining. In February 2009, for example, Google researchers published a paper in nature, predicting a seasonal flu outbreak that caused a sensation in the health care community. Google has "trained" large data on the 50 million most frequently searched entries between 2003 and 2008, trying to find out whether certain search terms are geographically related to the U.S. Influenza Disease Prevention and Control center. CDC often tracks hospitals and clinic patients across the country, but it often releases 1-2 of weeks of information, but Google's big data is discovering real-time trends.

Google does not directly infer which query terms are the best indicators. Instead, to test these search terms, Google has handled 450 million different digital models in total, comparing the projections to the actual flu cases recorded by the CDC in 2007 and 2008, Google found that their large data-processing results found a combination of 45 search terms, which, once applied to a mathematical model, had a 97 correlation with official data. %。

Data is often imperfect, spelling mistakes and incomplete phrases are common. Why can Google make such a predictable prediction? If it's a causal relationship, is it because people feel uncomfortable, or when they hear someone sneeze, or when they read the relevant news? Google does not consider this causal relationship, but from the perspective of relevance, to predict a sustainable development direction, Because the popular search term is in constant change, the outside of a butterfly wing flap, will make the search system, chaotic changes.

Researchers at the Warwick Business School in the UK, in collaboration with researchers at the University of Boston's physics department, are also predicting the ups and downs of the stock market through Google's trend (Google Trends) service. The researchers tracked 98 search keywords using Google Trends, including "debt", "shares," portfolio "," unemployment "," market "and other words related to investment behavior, but also include" lifestyle "," art "," Happiness "," war "," Conflict "," politics "and other investment-independent keywords, Found that some entries, such as "debt", became the main key word for predicting the stock market, this is titled "Using Google Trends to quantify the trading behavior of financial markets" (quantifying Trading Behavior in Financial CMC using Google Trends's paper is also published in the journal Nature. Similarly, in 2010, researchers at Indiana University in the United States found that the mood of Twitter users helped to predict the stock market. The "animal spirits" advocated by Robert Schiller, the winner of this year's Nobel Prize for economics, can be used to predict asset pricing in the context of large data correlation tests.

Of course, Google's algorithm is not a test lark, for example, earlier this year, "Google flu trend" has shown that 10% of Americans may have the flu. But the U.S. Centers for Disease Control and Prevention data show that the peak is only about 6% (see chart). The study found that this is because Google's algorithm did not fully consider some of the new external factors caused. For example, the increase in media coverage of influenza and the increased discussion of influenza in social media will have an impact on the service's data and statistics. The flu news Big Bang has changed people's search term to a large extent. This makes people associate with the classical "uncertainty principle" in physics. The physicist Bohr thinks that in quantum theory, any observation of an atomic system involves changes in the observation of the observed object, and that, like Google's algorithm, our own behavior may also change in Google's observations, so it is impossible to have a single definition of quantum, Nor is it possible to understand the trend of Google's predictions with the usual so-called causal nature.

Large Data and Chinese philosophy

When large data occupies the center stage of our information society, we need a new way of thinking to understand the world. The causal law in the traditional view of knowledge is challenged by the great challenge, and the relativity lets us liberate the prediction of the future from the understanding of the past.

From the perspective of knowledge theory, large data like quantum mechanics help us to enter the large scale structure of the universe. Perhaps the concept of "Qi Yun" in Classical Chinese philosophy makes it easier for us to understand the new world revealed by large data. Qian Mu in the "Chinese thought popular lecture" elaborated: How does Qi evolve out of all things in the universe? Qi is dynamic, not quiet, in gathering and parting, in the separation ... "gathers and the person is the Qi Yang, is called ' The Yang Gas '." Divided and scattered by the yin of qi, called ' Yin Qi '. "This yin and yang, is the Chinese so-called Tao." All the life and the passage can be reflected in the cycle of yin and yang, growth and decline. In an industrial age without large data, yin and yang cannot explain the causal relationship of a line like Western philosophy, and may be associated with superstition and mysticism. And the rise of large data, so that the first time humans have a direct tool to measure the change of yin and yang, predicting the rise of gas transport. Yin and Yang Five elements of the said, can be directly in the Google algorithm of various iterations Xiangshengxiangke out. If Schiller's theory of "animal spirits" really predicts the Austrian economic cycle, the yin-yang cycle, revealed by big data, may help people prepare for the next global economic crisis.

On a broader level, if every civilian is free to access the analysis of large data (rather than government monopolies), a new way of thinking is that data is no longer the cold Big brother-controlled machine in the 1984 world, where everyone can immerse their individual factors in the system, affecting the direction and decision of the system, All kinds of factors: risk, accident, love, cold, even error, can be reflected in the change of yin and yang in large data. All kinds of human consciousness and creation can be experimented and explored more quickly through large data. The sparks of human inspiration are the first to explode through large numbers of multiple layers, and this will be a beautiful new world--human ingenuity can be fully found in large numbers!

For marketers, big data is an endless treasure. All levels of humanity, the effects of various environments, such as changes in weather and market sentiment, can be demonstrated in the analysis of advertising, and users ' portraits will show up in real time, how to allocate and optimize media investment, how to design product attributes, and how to position accurately ... An incredibly powerful tool will change many of the business decisions.

But can big data replace entrepreneurs? 360, millet, micro-letter, QQ and other products although may benefit from large data-driven user portrait and product cycle iterations, but the inspiration of entrepreneurs, risk-taking courage, sensitivity to the market and touch, and a little bit of the timing of the luck, it becomes more important, Because of the data extraction, application, interpretation, judgment of all aspects of human imagination, provides an eternal challenge.

Do the human, know the destiny, the world's big, its hing its death, in the large data of the universe. Perhaps the Duke's master of data analysis should also take some Chinese philosophy.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.