The world-wide information revolution has caused almost every industry to face big data problems. In order to effectively deal with the all-round challenges brought about by large data and make full use of this unprecedented opportunity, the industry, the scientific and technological circles at home and abroad and even governments are actively planning and formulating strategic plans. March 2012, the United States government announced a 200 million dollar "big data Research and development plan", the study of large data to rise to the national will. The European Union has also introduced similar initiatives. In this context, the Chinese Computer Society in October 2012 set up the CCF large data expert committee, intended to explore the scientific and technical issues of large data, promote the construction and development of large data discipline, and build a platform for academic exchanges, technical cooperation and data sharing for large data-producing.
CCF large data Expert committee hope that through the study of large data hot issues, fully understand and know the key challenges and unique value of large data, better grasp the direction of investment, academia, industry, users have guiding value. Through our joint efforts, the real breakthrough to solve these problems, common access to large data of great value.
Hot issues for large data:
1. The interdisciplinary boundary of data science and large data
This problem synthesizes two questions, namely the basic intension of large data and the scientific problem of data. The former focuses on the basic definition and basic structure of large data. So far, what is large data, in the industry, academia did not form a recognized scientific definition, the content and extension of large data also lack of clear explanation. What are the key features that distinguish large data from other data? IBM proposed 3V, namely volume (large volume), produced (mode) and velocity (fast). Another V, or value, was proposed, indicating that large data, although of a high value, had a low value density. In addition, whether large data means full data, still need further discussion and clarification. Finally, a formal and structured description method for dynamic, high-dimensional, and complex large data is needed to develop the data processing technology. The latter focuses on the correlation and difference between data and physics and human society, and probes into the existence of data science which is independent of the application field. If there is data science, what is the classification system of subject matter? There is already a consensus that the complexity of large data comes mainly from the complex links between data. In addition, the new learning theory and cognitive theory should be an important part of data science.
2. Basic model and paradigm of data calculation
Many outstanding characteristics of large data make traditional data analysis, data mining, and data processing methods are no longer applicable. Therefore, in the face of large data, we need to have the basic model of data-intensive computing and new computing paradigm, we need to put forward the efficiency of data calculation methods and research data computational complexity and other basic theories. Because the volume of data is too large, even some of the data itself in the form of distributed, difficult to focus on processing, so for large data computing needs from the Central, Top-down mode to the central, bottom-up, self-organizing calculation mode. In addition, in the face of large data will form a data based intelligence, we may need to look for similar "data volume + Simple Logic" method to solve complex problems.
3. Large data characteristics and data states
This problem synthesizes three candidate questions, namely the relational dimension complexity of large data, the spatial dimension complexity of large data and the time dimension complexity of large data. Large data is often generated from a large number of sources, and often contains images, video, audio, data flow, text, Web pages and so on different data formats, so its modal is various. There is a complex relationship between the large data of multimode, and the relationship of this heterogeneous relation is sometimes dynamic and reciprocal, which leads to the complexity of the correlation model. The spatial dimension of large data mainly focuses on the generation, perception and collection of large data in the ternary world of human, machine and object, as well as the transmission, movement, storage and calculation of data under different granularity. In addition, it is necessary to study the theoretical and technical challenges to the analysis and processing of large data in the unbalanced state of space and density. The time dimension of large data is intended to study the life cycle, state and characteristics of large data in time dimension, and to explore the flow analysis, incremental learning method and online recommendation of large data. Finally, the paper studies the timeliness requirement of off-line and online processing of large data.
4. Data transformation and value extraction of large data
This problem is mainly composed of "How to make large data smaller" and "how to extract the value of large data", the former is to clean the data without changing the basic attributes of the data, and to reduce the data scale without loss of value. Therefore, it is necessary to study the data transformation methods such as sampling, weight, filtering, filtering, compressing, indexing and extracting metadata, so as to make large data smaller, which can be regarded as the "physical change" of large data. The latter can be regarded as the "chemical reaction" of large data, and the exploration and visualization of large data will play a role, and the human-machine interaction analysis can integrate human intelligence into this process and ferment and refine the value of data through group intelligence, social calculation and cognitive calculation. Realize the value leap from data analysis to data value judgment and data manufacture.
5. Security and privacy of large data
As long as there is data, there must be security and privacy issues. As data grows, large data faces significant risks and threats, and more and more reasonable rules need to be adhered to, and traditional data protection methods do not meet this requirement. Therefore, faced with large data security and privacy protection, there are a number of challenges that need to be addressed, including: Large data computing ethics, large data cryptography, secure computing in distributed programming frameworks, trustworthiness of remote data computing, security of data storage and log management, data mining and analysis based on privacy and commercial interest protection , forced access control and secure communication, multiple granularity access control, and credibility of data sources and data channels.
6. The challenge of large data to it technology architecture
This problem is a new interpretation of the "big Data on the system" for hot issues. Large data for the system, whether the storage system, transmission system or computing systems have put forward a lot of very stringent requirements, and the existing data center technology is difficult to meet the needs of large data. For example, the growth of storage capacity is far from the growth of data, the design of the most reasonable tiered storage architecture has become the key to information systems. The distributed storage architecture needs not only scale-up scalability, but also scale-out scalability. Therefore, it is imperative to revolutionize the entire IT architecture. In addition, large data platform (including computing platform, transmission platform, storage platform, etc.) is a bottleneck in the chain of large data technology, especially the high-speed transmission of large data, the need for revolutionary new technology.
7. Large data application and industry chain
Most of the large data expert committee members agree that the research and application of large data must be combined with domain knowledge, especially in the initial stage of large data research, the computer science and technology workers must be open-minded to all fields of scientific and technical personnel to consult, really understand and familiar with the characteristics of data in various fields. In view of different domain environment and different application needs, the methods of obtaining, analyzing and feedback of large data are different. To this end, we need to expand the research of data characteristics and business characteristics, carry out the analysis of large data application classification and technology requirement, construct the whole lifecycle Application model from requirement analysis and business model to data modeling, collecting and summarizing feedback, and finally to data analysis. In fact, different application environments and application goals represent different value orientations, which have a great impact on the value density of large data.
8. Ecological environmental problems of large data
Large data, as the "new oil" in the 21st century, is a valuable strategic resource, so the sharing and management of large data is a part of its ecological environment. For the sharing and management of large data, ownership is the basis, which is not only a technical problem, but also a legal principle. The rights and interests of the data need to be specifically identified and protected, so as to protect the interests of many Parties to solve the problem of data sharing. To this end, many obstacles may be encountered, including people's concern about the law or credibility, the need to protect competitiveness, and the location and manner of data storage is not conducive to data access and transmission. In addition, the problem of ecological environment involves the intersection of politics, economy, society, law, science and so on. Because large data will have a great impact on the national governance model, Enterprise's decision-making, organization and business process, and personal life style, this model of influence is worth studying deeply.
2014 Big Data development trend
1. Data Resource
This candidate development trend has received the most attention from the Commissioners. The resource of data means that large data becomes an important strategic resource at the enterprise, society and country level. 2014 Big data will become the new strategic commanding Heights, is the new focus that everybody robs; Big data will continue to be the assets of institutions and a powerful weapon to enhance the competitiveness of institutions and companies.
2. Large Data privacy issues
Large data for privacy will be a major challenge, existing privacy laws and technical means difficult to adapt to large data environment, personal privacy is increasingly difficult to protect, there may be paid privacy services, data "mask" will be popular. And it is expected to enact standards and regulations on large data privacy in 2014.
3. Deep integration of large data and cloud computing
Large data processing is inseparable from cloud computing technology, cloud computing provides resilient and scalable infrastructure for large data, and efficient patterns of data services, while large numbers provide new business value for cloud computing, so starting from 2013, large data technology and cloud computing technology must enter a more perfect bond period. In general, emerging computing patterns, such as cloud computing, Internet of Things, and mobile networks, are both places where large data are generated and areas where large data analysis methods are needed.
4. Intelligence based on mass data (knowledge)
In the 2014 there will be more intelligent results based on massive data (knowledge), and even the possibility of producing artificial brains. At least like Chinese reactiveness such problems will be solved completely. Since all people can think of questions that have been answered before they are asked, even in the absence of thinking and logic, the experience of predecessors can also be used to function as a brain, or even to infer directly from large data.
5. A revolutionary approach to large data analysis
In the Big data analysis, 2014 will be a revolutionary new method. Like computers and the Internet, big data could be a new technological revolution. Data mining, machine learning, and artificial intelligence based on large data can change many algorithms and basic theories in small data/small worlds, which is likely to produce a theoretical level of breakthrough.
6. Large Data security
The security of large data is worrying, the protection of large data is more and more important--the increasing of large data, the requirement of physical security of data storage is more and more high, thus higher request to the multiple copy of data and disaster-tolerant mechanism are put forward. Into the 2013, the Internet and digital life make it easier for criminals to get information about people, and more difficult to track and prevent criminal means, there may be a more sophisticated scam, that is, large data has sold you.
7. The rise of data science
2014 Data Science as a major data-related emerging disciplines appear, there will be specialized in data science for the formation of a doctorate, Master's degree and even undergraduates appear. At the same time, there is a large number of data science published.
8. Data Sharing Alliance
The data-sharing alliance will grow into the core of the industry in the 2014. Data is the basis, previously supported by the Ministry of Science and Technology, has established a number of areas of data sharing platform, including meteorology, earthquakes, forestry, agriculture, marine, population and health, the Earth System Science data sharing platform. Data sharing is then extended to the enterprise level.
9. Large Data new occupations
The big data will spawn new jobs in 2014, such as data analysts, data scientists, and so on. Experienced data analysis talents become scarce resources, data-driven job opportunities will show explosive growth. The best scientists in the Big data field are switching to stocks, futures, and even gambling (more than anyone else can see).
10. Larger data
The big data now is not big enough in the future. 2014, the large data will receive more attention, research, development and application, the result is: large volume of large data characteristics, speed, mode, and low value density of a few v characteristics will become more extreme. In particular, the value of large data will be less and less--data growth, how to remove noise in large data, such as garbage data, and then mining and extract valuable information is also increasing the difficulty.