In the Cloud October 29, 2012, "China Cloud mobile internet Innovation Grand Prix Final award ceremony and innovative Entrepreneurship Forum" held in Beihang. At the meeting, the joint chairman of China Cloud Industry Alliance, academician of Chinese Academy of Sciences, Beihang University, the principal to do the theme report. He says three big challenges are facing the big Data age: 1. Software and data processing capabilities. 2. Challenges of resource and shared management. 3. Credible capacity for data processing. The field shorthand is as follows.
Co-chairman of China Cloud Industry Alliance, academician of Chinese Academy of Sciences, dean of Beihang University
Academician of Chinese Academy of Sciences bosom enters Peng: Respect everybody temporarily, especially from the United States came to participate in our this forum Lurkey, dear Benefit people Mister, Robin Li, I speak quickly today, leave the time to Mr. Lurkey, because he has to catch a plane in the evening, So he can give him more time to introduce the exciting events in this field and think about the future.
Internet technology, we all know it pursues the goal, is to pursue stronger faster and higher. So, from the past, Microelectronic technology has created many new opportunities, we know that Moore's law, the transmission of communications technology, gives us the ability to plug in excellent piping, computing and storage capabilities to strengthen the current supercomputer and storage, but now because of new forms of development, as well as technology itself barriers, Relying entirely on the growth of the traditional way, there have been great limitations, the most important thing is that the Internet is a major change in our foundation.
A recent book says it was the 5th technological revolution or the 3rd Industrial Revolution. Without exception, the Internet application and social life together, become one of the most important development content. And from the current development and actual operation of large enterprises and it vendors, we have seen that data has indeed become an important infrastructure for strategic and economic development. It also benefits from the rapid development of information technology that we are talking about, leading to new work and new exploration that we are now focusing on with data and services. There has been a huge increase in the overall change in the data. We look at the content of global data growth, the large amount of data we have today, and what we are now accomplishing, 90% of the digital content is a huge change compared to a decade ago, 20 years ago, 60 years ago. But back in the face of such a large data space, we may have a new challenge, for example, by 2007, Facebook used the Data warehouse to store 15 terabytes of data, but by 2010 years, the data was compressed every day, are not data warehouses to hold more than 4 times times the sum of the past, commercial parallel data rarely more than 100 nodes. Yahoo's Hadoop cluster now has more than 4,000 nodes, with more than 2,700 Facebook warehouse nodes. And in a large number of data applications, there are in the scientific private computing, medical data. That means that a lot of data is now real time and it's starting to affect our entire work, our lives, and even our economy.
Therefore, some people also mentioned that from the era of capital economy into the era of digital economy. In particular, we have seen the virtual world, the physical world and human society associated with the creation of more than before the same. So some scholars say that 18 months of data volume led to increased storage and processing capacity, began to lag behind the existing data growth, leading to the current knowledge of our society is facing the biggest bottleneck. In this bottleneck, the past data, mainly commercial data, is determined by the data. And now the data are uncertain data, there are a lot of real-time data. As a data processing capacity, it should be said that in the last decade, we have been exploring, for example, based on the scientific calculation of grid computing, Peer-to-peer computing between the edge of data, as well as the recent years very hot intelligent Earth, intelligent city and the internet of things.
In recent years, the city has become a real-time large-scale cloud computing, whether it is the future to solve the massive content of the important aspects, we are still exploring, is a more concentrated area. No matter from which point of view, cloud computing openly handles the problem, to the massive data how enhances the intelligent processing ability. However, the face of the same technical problems arises, first, data management capabilities, processing capabilities, high reliability of security services. It is precisely because of the limitations of these three capabilities and development space, but also for the current data processing has brought new opportunities, is the data and economic and social closely linked.
So, as we said before, the three patterns of scientific research in the past, from experimentation to theoretical analysis to computation, have been the basic means of our current scientific research and major discoveries. So, another pattern now emerges, which is the so-called data intensive, has been in the process of influencing research and production, there is a fourth form of support for new scientific research development, may also be not too late. As in the application, in fact, cloud computing, mobile Internet, and more hope that in the virtual world of the Internet, the establishment of integrated systems, such as cloud computing or a virtual computing environment, can make all the resources and data, traditional data, can focus on human sharing and create new knowledge, To form a more effective integrated environment and development space. What is cloud computing? Now many people say it is four-dimensional, large, type, low value density, unlike in the past, the value of handicrafts is much higher than the current value of cloud computing.
And such a low value density, by common sense is to create infinite value, the challenge is much greater than the general analysis of irrelevant data, while the update speed is very fast, a commercial data retention is time-sensitive, now the data, we see the Web page, see all kinds of news, are in rapid duplication of data, human health , education data in a large number of updates, a moment of data is not important, but a long time accumulation and cross-section of the combination of new data space-time view appeared, this value brings us the creativity, I think may be big data is unprecedented. It is precisely because of this situation, I give an example, what is the situation? I have mentioned this example before, there are 2000 people in the cafeteria dining, suddenly there are 200,000 people, to meet the basic life security, there are basic improvements, how to do? 10 times times more people to eat, to maintain the state of survival, the simplest way, Chinese cabbage stew tofu, how stew? Boiled boiled water, add tofu, cabbage, finally to do, formed a new Ford automobile production line, a process management, a production line management, data production line in formation. This new form of formation began to appear in different areas of specialization, all kinds of vertical platforms, the integration of public processing models of the unified level of platform, is in the creation. So this creation is actually a model of cloud computing, more emphasis is based on the data center of a new service application model, the establishment of developers and operators of a new mutually beneficial content. Not to solve the high performance of past commercial data and scientific data, but to ensure a new performance price ratio, not high-quality, but can be handled, not to be very accurate, but basically available. Therefore, for the low value density, but the increase in data volume of the new problem, this is our data into the scale of the development stage, this stage of development is also our dream of computer people, the so-called everything by calculation. Our former physical world builds data models through simulations, supports development through high-performance computers, builds our equipment through intelligent activities, embeds systems, we say wearable computers, embed systems. There is also the ability of our internet to communicate better. So, this thought is the Turing Prize winner battle to talk about, according to his thought, to these three characteristics summary.
What did the past business calculations and scientific calculations bring? The scientific computation solves the Turing machine and the algorithm, lays the computer Foundation, the science theory. Business computing is to achieve the management of the process, workflow is one of the representatives, social computing, the big data in the case is what, not too clear. In that era of science, so that the development of the operating system is very strong, manage the resources below. To the business calculation, the development of the database, and to the large scientific data, what is the problem of large data? is not clear.
Therefore, as a result of this social computation, the current mathematical model, the software, the system's ability, all have the new completely different possible changes. So I'm here to say what I understand.
The first big problem is software and data processing capabilities. Because of the complexity and huge problems of software, the application of Internet and the uncertainty of data, the study of mathematical logic of software in closed world is still valid, but it is more open and dynamic. For example, the data model and processing, to another massive data input, how to do output, and can find the answer to the question. The past algorithm is to see if it can be calculated to determine whether the computer can be processed. Calculate good and bad. Now, with traditional computational complexity, we don't see all the big data. So, how to find out its approximate algorithm and the approximation algorithm in the effective time, this is the new scientific problem to the new scale of data. Why can't a traditional business database do that? First, it uses the authorization charge, the price is extremely high, the open source database maintenance will be more expensive than buys one authorization. We look at the past managing traditional data, a TB 10,000 dollars, a Hadoop system a TB500 dollar, what is most of the content? The traditional database is the Scarle up, performance improvement, CPU, storage and so on to continue to expand, this is the traditional parallel computing model. Now the large data is dispersed in the Internet, distributed and dynamic to increase the low cost of computing and service capabilities. Therefore, this approach is also a new challenge, and for the software, what model can adapt to its development. We know that Hadoop, graphics processing, a basic programming model far beyond our past programming language, beyond the design of our website. In the new way, it raises new questions about the minimum latency and simplest task operations, and asks for the challenges that arise. At the same time new features appear, because it is the content of the distributed Surge node, because of its scalability to improve his productivity, throughput, through new fault tolerance and reliability of the way to maintain the system, the Internet system will never have a short board principle, each node is the highest point. Therefore, as a fault tolerance way, there have been new changes. As a field, we see changes in models to software, at the same time, in data science, in the past by hand analysis, commercial data as the basic way, in large scientific data, has begun to appear more and more pale, because if the past data is a manual agricultural society, now into the industrialized society. The basic mathematical and physical characteristics of industrial society are statistical Physics, experimental physics and our past stochastic processes. In the past, computer-dependent, and mathematical statistics under limited conditions, algebraic systems to establish new ways of processing, are becoming more important to a content. As a result, the tools we have for dealing with this type of data have changed a lot.
Yesterday and Mr. Lurkey, also discussed, now in many important enterprises, statistical science, experimental physics has become the most important means, the discovery of new drugs, people's habits, reading analysis, business model, from here a large number of unified analysis emerged. As I have mentioned before, the past 500,000 words to learn spelling solutions, voice, text or sentence understanding, now 500,000, 5 million, 50 billion of the sentence group, and then in the past is not appropriate, but a large-scale, new, industrialized data based processing capacity. The new data science theory is required to propose new challenges to algorithms, computing methods, and new search engines. This is a big opportunity for academia. Previous file systems, data Internet, search based on different angles, from the details to the whole, from the local to the system of new ways, bring new opportunities. This content also brings a problem, although low density, low value, but the data quality is still a persistent problem, how to solve the data quality, new qulity, and the past data processing are different.
Therefore, as new large data, software and data processing capabilities, become the most important, but also the future of scientific research in other disciplines, a means of development, the second challenge is about resources and shared management, so many resources to solve and constantly support new requirements of the Scarleout model, There are many problems in how to use storage and data as the management of public resources to solve different types of applications. As you know, the site's environment, or some environment, affects the ability of the system to survive and scale.
So, this ability has not only affected our general application, energy, data management, as the value is very high, so its energy consumption has become an important issue. The most important question here, then, is whether the future of resource management is more systematic, or if there is a single vertical management system, and the so-called unified operating system, has become the most important issue of contention now. How to manage the data and manage the good resources becomes the important content. The solution to this approach may create a new way for the internet, which is the emergence of data and service operators, because users are the creators of data, service software provides all kinds of services, and everything imaginable or digitized can be served as a service. So data and service operators will become the model of telecom operators, important and rapid development of content. Then, the emergence of this model may be on our internet, the development of mobile Internet will also have important content, to solve the diversity and development of processing storage problems.
The third issue is the credibility of data processing, the integration of cloud security monitoring, system recovery, and further development of the high reliability of the ability. For such a problem, it should be said that with the development of technology, a security problem, the credibility of the problem is and major system applications are accompanied by, but it is really an important issue. Not only is there a lot of worthless data, but privacy data is also important. Therefore, in the age of large data, I think with the distribution of data, heterogeneity and dynamic rapid change, coupled with the quality of personal possessions, computable problems, manageable issues, and trusted issues, together form the new three classes in the Big Data era and we need to have new means that might be concerned about three typical scientific questions.
I understand that software development for so many years, decades, the computer is data processing as the center. The birth of all things is the core of data processing. But, into today, has gone beyond our past simple data, if we look at the 80 generation of software to become commodities, the 90 generation of the second revolution, is a simple, basic, important information services. Now, there is a new development in which data creates value, not a simple application or the accumulation of information. So data-centric will give us opportunities. But from the past IT development, although the application for some time, but the technology breakthrough and new application carrier window time is not very long. Therefore, theoretical and technical innovation and sustainable development, will bring us opportunities. However, while the innovation of the application model is more important, especially the innovation of it, in fact the constant verification of the case law, Hadoop is a simple programming model, is to keep it concise and most effective, is our IT field.
Therefore, in this field, young students, young people, in this field, your mind has not been happy enclosure, there is a lot of space to create, so this is the most opportunity to develop content, thank you!
(Responsible editor: The good of the Legacy)