"Don't be misled by big data, large data, which is more about data mining than it is about big numbers," he said. "In the tenth session of the National Information Technology Expert Forum, the Chinese Academy of Engineering, China Internet Association Director Hequan pointed out that large data needs more emphasis on data mining, and the existing technology applications, talent, security and privacy issues, the most critical is to have a national large data strategy, Make it an effective way to change the mode of economic growth.
The value of large data seems Shari
The NBA has since the 80 's the performance of the players on the field of data, after more than 30 years of accumulation has been able to identify each player on the pitch of the weaknesses, to facilitate the coach to carry out targeted tactical arrangements. At present, half of the 30 NBA club clubs have hired data analysts, whose average winning rate is 59.3%, while the team without data analysis has an average of 40.7%. This is the value of large data.
Where does the value of large data come from? Hequan that the large data and value is not necessarily proportional. For example, record a person's physical data every minute, it is useful to know the physical condition of the person, but if he records every millisecond of his body data, the amount of data will be 60,000 times times higher than the former, and its value cannot be increased compared with the data recorded per minute. The value of large data is the universality of the number of samples. Statistics of a person's physical health data per minute compare to the statistics of 60 people per hour, which may be statistically more significant. Large data is often low in value density. Most of the data in large data may be duplicated, and ignoring some of the data does not affect its mining effect. So it can be said that the value of large data like Shari and Haidilaozhen.
Microsoft's research found that Facebook's 90% Hadoop task dataset was below 100GB, with Yahoo averaging 12.5GB. Beijing bus card passengers 40 million times a day, the subway 10 million people, the daily cumulative data is MB, a year down also less than TB class, at best, but the data, but the excavation of this data is clearly able to get the Beijing crowd to use the bus travel rules, It is of sufficient value to optimize the setting of bus lines in Beijing. As a result, Hequan points out, the fact that small data is also worth paying attention to is also valuable for data mining that is not terabytes-scale.
Not all the data on the web is trustworthy.
Google's flu index, which was released two weeks earlier than the US CDC in 2008, was 0.9 more similar to its data than the H7N1 flu outbreak. The January 2013 flu epidemic in New York was grim, and the government issued a "public health emergency" announcement, with widespread media coverage that affected Google's users ' search because it led to a false positive in the value of its flu index, well above the CDC's statistics. In the Google Flu index of the Kai
, New York's Rochester University used Twitter data to try to predict the impact of the flu on the body 8 days in advance, with an accuracy rate of up to 90%, although Twitter's users are mostly young, and seasonal flu is more targeted at older people and children with weaker resistance, Therefore, based on Twitter Weibo to judge the flu is one-sided.
Hequan that the data on the network are not trustworthy. The authenticity of the network data is mixed, especially the micro-blog spread the false message spreads quickly, but the message in the micro-trust circle is not easy to monitor, challenges the information content management. Used to think that "there is a picture of the truth", in fact, pictures can be stealthily, mistaken, space-time disorder, or photos are right, but the interpretation of the text is fabricated, such things have become commonplace. In view of "rumors forwarded 500 times is a rumor," some sites stipulate that all posts, whether true or not, automatically capping no more than 499 times, from the collection effect of public opinion, artificial truncation led to the loss of authenticity.
He also pointed out that the data collected by the sensors were not all credible, especially since historically the data of the sensor differed significantly from the data reported by other similar sensors, which should be discarded. Since data is duplicated and not all useful, data filtering is required, which makes sense for simplifying storage and improving credibility. Heterogeneous multi-source data can also be used to improve the credibility, and the collection of multi-source heterogeneous data is beneficial to the understanding of data. For example, through the urban traffic monitoring system can be real-time control of traffic flow, if combined with government data and network data, you may know the cause of traffic congestion.
Attention should be paid to data mining and utilization
The mining of large data deepens the application of information technology, gives birth to new application and new industry appearance, large data improves the intelligence level of management and decision, Hequan put forward to pay attention to data mining and utilization, it is not only large data, but also meaningful to mining small and medium data.
The larger the amount of data is, the more difficult it is to deal with, but only the need for more servers or higher-speed servers. Hequan that the main challenge of large data is real-time data changes quickly. For static data, data can be brought into the program to process, but for dynamic data, need to bring the data into the program. The larger challenge of large data is the variety, especially the unstructured. For structured data, you can use relational database technology for processing, and for unstructured data, NoSQL. The virtual storage platform for structured data adopts dynamic layering technology, which automatically moves the commonly used data to the highest level according to the frequency of the data being invoked. In order to use content archiving platform for unstructured data and integrate structured and unstructured data into a single dynamic archive architecture, a set of software and meta database Rules is designed, and different dimensions are established by means of labeling the data so as to have the function of fuzzy query.
Hequan pointed out that "big" is only one of the characteristics of large data, large data include ABC three elements: Large analysis (analytic), high bandwidth (bandwidth), large content. Real-time is the challenge of large data mining, but unstructured is the main challenge of large data mining. At present, the major international data processing technology is mainly structural data, it is said that more than 80% of large data is unstructured data. In 2012, Stanford University, in collaboration with Google, established a deep learning network to learn from tens of millions of video frames from YouTube, learn to recognize a cat's face for 10 days, and then look for a cat from 20,000 photos that are not seen, with an accuracy of only 15.8%. It can be seen that the mining technology of non-structural data has a considerable distance to practical. At home and abroad there are many successful examples of large data applications, but basically still structural data, the mining of structured data is a key point for large data applications.
The challenge of big data is ultimately strategic.
Hequan pointed out that the current large data problems are mainly reflected in technology and applications, talent, security, privacy, development strategy and so on.
The first is technical and application issues. China has the largest population in the world, but in 2010 China's new data was 250PB, just 60% in Japan and 7% in North America. Some of our departments and agencies have a lot of data but beggar-thy-neighbour, preferring not to share it with the authorities, leading to incomplete or repetitive investment. China's data storage volume reached 364EB in 2012, with 55% (200EB) data requiring a degree of protection, while less than half (44%, 96EB) of data are currently protected. There are many gaps in China's autonomous and controllable data analysis technology and products compared with developed countries. Domestic enterprises in the database, data warehousing, business intelligence analysis software and other areas of weak foundation, especially large data has lagged far behind the foreign advanced enterprises.
Second, the talent problem. 500 years ago, Leonardo da Vinci could be a painter, musician, engineer, scientist, 100 years ago, doctors can understand all branches of the medical field, today a junior doctor must know about 10,000 diseases and syndromes, 3000 drugs and 1100 methods of testing. It is estimated that a professional doctor needs to study for 21 hours a day to keep abreast of the development of the subject. Gartner Consultancy predicts that large data will bring 4.4 million new IT jobs and tens of thousands of non-it jobs worldwide, and McKinsey expects the United States to 2018 in-depth data analysis talent gap of 140,000 ~19 million people, but also need data requirements and technology and application of the management of 1.5 million people. China's ability to understand and apply large numbers of innovative talent is a scarce resource.
Then there is the question of security and privacy. The use of large data first requires that government data should be publicly disclosed in principle. The excavation and utilization of large data need to be based on laws. China needs to develop "information protection Law" and "Information Disclosure Law" as soon as possible, both to encourage the community-oriented data mining and to serve the society, to prevent the infringement of privacy, to promote data sharing and to prevent data abuse. The hidden danger of security and privacy protection is still large, important data storage and application can not rely too much on large data analysis technology and platform, need to pay attention to the risk of information leakage.
Finally, there is the question of development strategy. Information should focus on the hardware to attach importance to software, and to pay attention to the use of data, not only large data, small and medium-sized data mining also has significance. The need for national large data development strategies, large data is a very strong application-driven service, its standards and industrial structure has not yet formed, this is our opportunity to leapfrog development, but should not herd in the target unknown situation everywhere to build large data centers, everywhere to engage in "data real estate", but need to pay attention to the strategic development of large data As an effective way to transform economic growth.
(Responsible editor: The good of the Legacy)