Scientific reason--a mysterious cloak of large data

Source: Internet
Author: User
Keywords Large data data mining traditional

Scientific reason--a mysterious cloak of large data

--several important viewpoints on large data

Large data industry chain basic structure (source: Shanghai Institute of Scientific and Technological information, organized)

The rise in the concept of large data has attracted a lot of controversy. Some call it a "new bottle of Old wine," and others argue that the opportunity for big data is exaggerated. In fact, these are related to not really understanding the nature of large data. The development of any thing has its objective law, the big data is not "the Monkey King that jumps out in the stone", it also has its own "biological parents"--computer science and data science. It is because of the combination of the two, as well as life science, geography and even social sciences and other areas of the degree of data to deepen, so that large data has an unusual "gene." Moreover, with the maturity of the Internet industry, the landing of the concept of things networking and cloud computing, data-driven innovation, the use of large data will be more extensive, the potential for change will be limitless.

For large data, there are several important judgments and viewpoints:

Large data thinking originates from data mining (Mining) and is higher than data mining. It can also be said that data mining is the "close relatives" of large data. Data mining with the help of computer to discover hidden knowledge and law from massive data, it is a cross subject which integrates the knowledge of computer and statistics, and its core theory of artificial intelligence, machine learning and pattern recognition has made remarkable progress in implementing knowledge management in the 90 's. In essence, the "great change of thinking" and some data-driven business intelligence (Business FDI) mode innovation brought by large data are the extension of data mining theory, which is expressed as "the thinking change brought by data mining relative to mathematical statistics" may be more accurate. For example, causal relationship is an important part of mathematical statistics, based on the perfect mathematical theory, the representative is a regression model, and correlation is an important part of data mining, based on powerful machine computing ability, Representative is a neural network, decision tree algorithm, This makes it possible to get good analysis and prediction results without understanding the complex causal logic behind it. However, data mining is typically oriented towards structured data. Large data also involves data collection, extraction, transformation, storage and so on, and must face the unstructured data.

Big data breakthroughs come mainly from technological innovation. It is manifested in the "adaptation" and "Application" of diverse (produced), Mass (Volume), Fast (Velocity) features. First, the storage data from the structure to the semi-structured, unstructured, such as Web pages, documents, reports, multimedia and so on, resulting in a batch of unstructured data based proprietary mining algorithm generation and development. The second is that the database from the relational type to the non relational type, distributed expansion, relational database is organized in the form of rows and columns of structured data tables, such as Excel tables, the disadvantage of small storage capacity, data scalability and diversity is poor, and new relational, distributed database can make up for the above deficiencies. Third, data processing from static to real time interaction expansion, the new large-scale distributed parallel data processing technology can real-time processing social media and Internet applications generated a large number of interactive data, effectively respond to the diversity and the complexity and timeliness requirements.

Technological innovation has directly contributed to the realization of value. Thanks to the above techniques, data mining theory obtains the data quantity and processing ability which is increased geometrically, and many unverifiable ideas and methods can be realized. For example, traditional business intelligence (BI) analysis has a "centralized" step, that is, the need for a large number of data extraction and centralization before the formation of a complete data warehouse, this step is often a BI analysis of the entire process of capacity bottlenecks. But the BI analysis based on the large data distributed technology need not "centralize", greatly enhance the agility and intelligence level, thus promote the machine learning, semantic processing and other fields have a major breakthrough, directly contributed to the Mahout machine learning algorithm set, Siri voice assistant, such as the advent of a batch of commercial products.

The potential of value realization is mainly embodied in data opening strategy and data driving paradigm. At the strategic level, the data processing from the closed, breakpoint, static to open, massive, real-time transformation, triggered the community, crowdsourcing, grid and other new patterns, new models flourish, on this basis will promote the organization of data opening and public sharing movement of the rise. At the level of research paradigm, scientific research has emerged from deductive driving to data-driven development, such as biological gene and health research-intensive industries began to expand to data research, many traditional scientific research such as history, literature and so on have begun to try to use data analysis technology. But these major changes have yet to be achieved in scale, and the main beneficiaries of the current technological level of large data remain the Internet industry and various internet-based business models. When the penetration of information infrastructure, social openness and the integration of Network intelligent interaction technology have not reached a certain level, the application of large data is limited, and cannot reach the "omnipotent" of the society.

Internet enterprises are the driving and direct beneficiaries of the current large data value realization. As the development of the Internet has played an important role in the rise of the concept of large data, many well-known Internet enterprises have mastered the core data-related technology, launched the key products and services. Google, for example, has developed large data "three cores"-File systems (Google File system), processing algorithms (MapReduce) and distributed Databases (BigTable), creating the mainstream framework and paradigm for global data development. Yahoo based on Google's algorithm ideas, improved Hadoop open source framework, open to the vast number of enterprises and entrepreneurs to promote the growing industrial ecosystem; Amazon, Facebook, push the principal enterprises to develop various functional tools based on this framework, and data for consumer products to improve the user experience; IBM and other traditional it enterprises in the industrial chain more focus on downstream applications, for the industry to provide customers with system solutions. These enterprises can not only derive considerable revenue from new technology products and services, but also benefit from the data resources they occupy.

Big data will help clarify the value of cloud computing. In the years when the concept of cloud computing has just been put forward, many users of enterprise and enterprises have doubts about its application value. And with the emergence of big data, the value of cloud computing has once again received public attention. Because cloud computing helps solve the problem that large data can't be crawled, managed and handled, it gives it different storage and computing capabilities, making the results faster and smarter. It can be foreseen that cloud computing will become the most active stage of large data application analysis in the future. In the same way, large data provides the space for cloud computing's large-scale and distributed computing power, solves problems that traditional computers cannot solve, and further clarifies the value of cloud computing.

-Beware of large data supremacy. An important assertion of large data supporters is that, based on the full volume, the accuracy of large data analysis will go beyond traditional mathematical statistics, and causality will be replaced by related relationships. But the fact is not so optimistic, on the one hand, after 400 years of development of traditional mathematical statistics are not outdated, still play an important role in the economic and social aspects. For example, sampling is an ancient and mature statistical method, if the goal is clear, the method of science, in most cases, its conclusion is correct, and not inferior to the full amount of data. Objectively speaking, the total value is more embodied in some traditional mathematical statistics basic assumptions may be ineffective, such as the internet "long tail" phenomenon, resulting in normal distribution, Pareto law in individual areas no longer applicable, at this time need to rely on the full amount of data to seek the law. On the other hand, the full amount of the associated "noise" sometimes affects accuracy. For example, the "Google Flu trend", hailed as a big data case, has recently plunged into a trough, with a false rate of more than 90% per cent, unable to predict major outbreaks such as a (H1N1). Its core logic is: there is a correlation between the number of people searching for "flu" and the number of people with actual disease, and in fact, even those who go to the hospital to see the flu have 80%-90% actually do not get the flu, the surface of the network search behavior and reliable sources of information there is still a large gap and "denoising" process Many experts believe that, for the moment, the relationship is not enough to replace causation, but only as a complement.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.