"Zhongguancun Large Data Industry alliance" launched the "Big Data 100 Points" forum, at night 9 o'clock, in the "Zhongguancun Large Data Industry alliance" micro-letter group for 100 minutes of communication and discussion.
Bai: As today's Speaker is the Chinese Academy of Sciences Cheng researcher, we welcome!
Bai: Deputy General engineer, researcher, doctoral tutor, director of Network Science and Technology Key Laboratory, Institute of Technology, Chinese Academy of Sciences.
As the Internet high Performance software and algorithm theory, network search, network information security direction of the team leaders and academic leaders, leading the team engaged in national network space security, Internet high-performance software and network search and mining basic theory and algorithm research and related application system research and development, has presided over more than 10 national information security major special projects, the national major basic research Program (973), the National High Technology Research Program (863) and the Chinese Academy of Sciences Knowledge Innovation Project and other major tasks.
Cheng: First of all, thank you, Miss Bai presided over, thank you for creating a very good environment, we can brainstorm, share ideas. Second, the day before yesterday and the white teacher and Professor Xionghui's introductory speech is novel, the view is bright, let me benefit ah.
Cheng: Last week, Guodong let me from the domestic academic research community perspective to share our thinking. From the ability to say really a little reluctant, can only talk about.
Cheng: You know, since we started in 2012 to organize the Fragrant Hill Science Conference Large Data Forum, the establishment of China Computer Society large Data Expert committee, has been trying to advocate and call for joint forces to promote China's large data-producing benign ecological environment construction, Many of the big shots in today's group are direct advocates and actors.
Cheng: More than a year, through the organization of China's large data technology conference, CCF large data conference and a variety of major and small application summits and academic forums, combined with our National Academy of Sciences Network data Science and technology key Laboratory of major data-related major research and information analysis, Internet data analysis related application development practice, I talk about some of my own thinking.
Cheng: Today's introductory discussion can include three big chunks, including: the recognition of large data, the large data analysis technology supported by the engine platform system, and the basic problems of building large data base on ecological environment.
Bai: Engine platform system, singular or plural?
Cheng: It should be singular, hehe
@ Reitao: Cheng gave you a big platform and industry microphone.
Cheng: 1. Re-understanding of large data.
Big Data is a broad concept, a matter of opinion. For the concept of large data, the most commonly used definitions are similar to those described in Wikipedia: "Large data, refers to the amount of data involved in a large scale to not through the current mainstream software tools, within a reasonable time to achieve capture, management, processing, and collation to help the enterprise decision-making more positive purpose information. An obvious limitation of such definitions is a description of the characteristics of large data from the computer processing perspective of large data.
We know that the knowledge of a subject often begins with classification. Just as Darwin put forward the theory of evolution, the original motive was to divide the animals and plants that were observed all over the world into a system. In the classification system to refine, eventually formed a new world view and epistemology. We are now concerned about the network of large data, problems in related fields such as large financial data and large scientific data, like the beginning of the European Renaissance, observe phenomena from different fields and excavate value, and eventually we may be able to discover the essence and form a new "data epistemology" to produce the essential value effect.
I personally think that the "big data" more embodies a cognitive and thinking, it and Mr. Qian Xuesen advocated the "great wisdom" is very close to the essence. Money veteran "Big into wisdom" translated into "Wisdomincyberspace", emphasizing "will be a synthesis, to get wisdom." Large data from the connotation of the four V of the characteristics of the embodiment of a large number of "0 Gold broken Jade", there is a correlation between each other and the force, but the partial look is very fragmented, the value is not obvious. So with the data, not equal to have value, out of wisdom, the key to wisdom in the "set." All the facts, experiences, and information included in the large data are the objects and contents of the set. The collected raw data is often some no logic, not necessarily directly with the current grasp of scientific and technological interpretation, the need to integrate the data of each side, in order to dig out the great value of unknown predecessors. Each kind of data source has certain limitation and one-sidedness, the essence of the thing and the law hides in the correlation of various original data. Only the integration, integration of all aspects of the original data to reflect the full picture of things. To carry out large data research and application, therefore, large data is not only a kind of resources, a kind of tools, but a strategy, cognition and culture, to vigorously promote and establish "Data methodology", "data values."
Of course, we should not only look up the road, but also down-to-earth. Therefore, when the concept of large data is flying, we must seize the opportunity to dig value, but also to think about nature, not in the confusion of the lost direction
Bai: Wisdom becomes worth, there is a link from "set" to "scattered"?
Cheng: Yes, "set" generates Wisdom, "wisdom" disperses support for a wider "value"
@ Dapan: The idea of the data, ultimately rely on the system to achieve
@wuyj: Is there any increase in quantity to qualitative change?
Cheng: @wuyj So from the quantity to qualitative change, from "reductionism" to "System theory", similar to the truth
Cheng: From the industry's point of view, the current large data system has three distinct features and our 2013 end of the ten trends related to the release
High-efficiency depth analysis of "1" large data requires a dedicated system
In the context of rapid application of data growth, in order to reduce costs to achieve better energy efficiency, large data systems need to gradually break away from the traditional general technology system, the trend of specialized architecture and processing technology. In this regard, the domestic Baidu, Alibaba and Tencent three big internet giants have made a trial and achieved very good results. As we all know, Baidu's big data typical application is Chinese search, Alibaba's large data typical application is based on the transaction log analysis of data Services, Tencent's large data typical application is the image data storage and advertising based on user behavior real-time recommendations. Baidu set up a dedicated large data unit at the end of last year to dig deep into the value of large data. Alibaba has integrated large data technologies from different business units together to provide a unified service for data products. Tencent's data-platform division is integrating the company's data into the unified management platform. Alibaba is the most closely connected with the open source community in technology, Tencent Big data is now moving closer to open source technology, Baidu at the technical level preference for its own research and development, including software and hardware customization program is the first to put into practice. Technically, what they have in common is that they no longer rely on traditional IoE, but open source systems (such as Hadoop, etc.) to develop large-scale, high flux, low-cost, and strongly scalable specialized systems for typical applications.
"2" large data processing architecture diversified mode coexist
Currently, the clone of Google's GFs and MapReduce Apachehadoop has been widely accepted by internet companies since 2008 and has become the de facto standard in large data-processing areas. But the spark, which emerged in 2013 as a dark horse, ended the myth that big Data technology was no longer a single big one. Because of the different application, the Hadoop software system can not meet all the requirements, on the basis of fully compatible Hadoop, spark greatly improve the system performance by using more memory processing. In addition, the emergence of Scribe, Flume, Kafka, Storm, Drill, Impala, Tez/stinger, Presto, Spark/shark, and so on, is not to replace Hadoop, but to expand the ecological environment of large data technology, Promote the ecological environment to benign and complete development. In the future, there will be more, better and more specialized software systems in the Non-volatile storage level, the network communication level, the volatile storage level and the computational framework level.
"3" real-time computing gradually received attention from the industry
Google launched the Dremel in 2010, leading the industry towards real-time computing. Real-time computing is based on the performance of MapReduce, which can be divided into two modes: Flow calculation and interactive analysis. In a large data background, streaming computing originates from real-time collection of server logs, such as Facebook's open source scribe is a distributed log collection system, Apacheflume is a similar system. Apachekafka is a highly throughput distributed message system, characterized by high throughput and fault tolerance. Storm is a fault-tolerant distributed real-time computing system, which can reliably handle streaming data and perform real-time processing, and single machine performance can reach millions of records per second. Storm can be integrated Apachekafka as its queue system. As a supplement of batch calculation, the goal of interactive analysis and calculation is to shorten the processing time of PB-level data to the second level. Apachedrill is an open source Dremel implementation, although it has been applied but immature. The Cloudera-led Impala also refers to the Dremel implementation, while also referencing the MPP design ideas, is now close to the practical stage. Hortonworks led the development of Tez/stinger,tez, a DAG computing framework that runs on the yarn (Hadoop2.0 Resource Management framework), and Stinger is the next generation's hive. By the end of 2013, the Presto distributed SQL query engine, open to Facebook, could have interactive analysis of more than 250PB of data, up to 10 times times higher than the hive performance. A similar shark is the SQL execution engine on Spark, which benefits from shark's column storage and spark memory processing, and shark claims to be 100 times times better than hive performance.
Bai: Real-time computing, streaming data, complex event processing
Cheng: These are too technical details. The three features of large data systems are "engine specialization", "Platform diversification", "Real-time computing"
@ Dapan: Large data processing technology will not be relatively stable in recent years, there must be a variety of technology and systems.
Bai: Another term that is not in the stable period is in the period of opportunity
@ People with conscience and conscience: is the algorithm a problem?
Cheng: Yes, that's what we're talking about as "Onesizefitsall" is a dilemma and an excellent opportunity in the big data age.
Cheng: At this time, open source promotes the pace of big data, and big data will also change the mode of open source
Cheng: The algorithm from large data support system will become a pure "toy" game. Of course, except for the basic theory of algorithms
@ Dapan: The advent of the big data age has contributed to the value of it to the "data", allowing the value of the "system" to be taken away. "Open source of the system" will be easier than "data openness". The close correlation between open source and large data is inevitable.
(Responsible editor: The good of the Legacy)