In the internet age, large data is hot, many people must pro data, but can really say that the big data is not much, not to mention how to use large data to dig out the great business value. How do you define large data? What are the characteristics of large data? This paper aims to clarify the concept of large data, illustrate the application of large data and explore the future development of large data.
Q1: Is big Data a commercial hype?
The industry defines large data as 4 "V": Large volume (Volume), multiple species (produced), speed (velocity) and high authenticity (veracity). But this definition does not capture the nature of large data. If you look at only these dimensions, large data is a hype because they are just superficial phenomena.
The nature of the big data should be how to bring a better business model to business, and the success of large data applications depends on the decision makers to put forward good business issues and the business model associated with them. These business problems can be very simple, but there must be a series of related business models behind the problem.
For example, how to use the Smart Phone Application mall data to improve the user app recommended precision? Then, using the large data of the application mall can produce millions of dimensional data tables, and then establish a reliable and accurate recommendation model, so that the user experience level greatly improved.
The key to the success of large data applications is also determined by whether there is a clear commercial (or scientific) purpose, and the definition of a business model is its prerequisite.
Q2: The more data, the more useful?
First, if the goal of collecting large data is to create a predictive model of location data, then the training data for the model must contain the required information. But the problem is that it is not known beforehand which features are important, so the need to integrate as much data as possible, so that the machine to find.
But why not ask the experts in the field? It turns out that experts can solve problems on their own, but most of them don't know how they solve these problems. That's why in large data applications, the role of experts is more in helping to connect and aggregate as many data as possible.
In addition, to build a good predictive model, the amount of data used to train the model is sufficient. If the historical data is less than a certain scale, there will be a phenomenon called "over fit" (to make assumptions overly complex in order to get a consistent assumption). For example, if a clothing brand is designed according to the figure of a particular model, the clothes are likely to be thinner, so that most other consumers cannot use them. This "over fitting" phenomenon also occurs when a predictive model of large data is established.
So, the larger the amount of data, is not the prediction model need to learn the time must be longer? The answer is No. The results show that, under certain conditions, when the data becomes larger, the actual training time needed is shorter. Why is that so? can imagine: If a student in the study of a certain concept, only a few problems, then the students want to learn this concept thoroughly, it is necessary to turn each exercise over and over to see, to expand, so the process of learning will be slower. On the contrary, if he has a lot of different kinds of exercises, only need to put each problem over again, can deal with most of the future situation. As a result, students with more exercise skills will be less likely to learn the same level of time.
Q3: Will AI surpass the human brain?
After decades of exploration, it can be believed that machine intelligence can only be derived from the study of large data, and large data can only be derived from the interaction between human beings and the machine. If you want these interactions to generate enough data, make sure that these interactions provide useful services to humans.
Now, what is the most abundant data? Full data is first and foremost the most easily recorded part, such as voice, image, text, etc. Is it possible to get information about human brain activity directly and to enrich our intelligence? Today's technology, such as Mr Brain-imaging technology, is not accurate enough, so that systems learned through learning can be more one-sided than the human brain (such as Ibmwatson), but in terms of versatility, Ai is far from the human brain at this stage!
Is it possible that in the near future a robot with artificial intelligence becomes the enemy of mankind? But the premise is: The leader of these robots must be human.
Q4: How to solve the user privacy problem?
Privacy issues are actually much earlier than big data, but privacy is really a household issue, but after big data becomes hot. From the revelation of the Snowden to the Apple data upload, as more and more media, the public concern about privacy is increasing. The biggest paradox of privacy is that, on the one hand, the data is tightly wrapped and, on the other hand, used to find something useful that has to be opened and uploaded.
At present, the data privacy problem has three considerations: 1. Technology 2. User Benefits 3. The degree of social acceptance.
Technically, the previous solution to the data was to migrate the data from the terminal to the compute end (such as computing Center), and then send the results to the terminal. This approach will undoubtedly raise privacy issues, because once the data leaves the user's personal terminal, there is no guarantee of who will have access to the data, and the privacy of the data is not protected.
To protect privacy, a new model is to "compute with data", that is, to use the powerful computing function of the terminal itself, to make a result (such as a predictive model, a local model) in a terminal, and then integrate the model with a generic model. This model will undoubtedly introduce more computation and complexity, and is now a frontier research field. This approach is like someone wants to buy stocks, but do not want others to know his own needs, only read the information on the stock online, and the needs of their own knowledge to make a decision to buy and sell. As long as everyone is smart enough to have enough computing power, such a system will protect everyone's privacy to the fullest extent.
Another approach is to still transfer data to the computing center, but before transmission, the data is changed so that the key privacy information is hidden in the transmission and calculation, which makes it impossible to reverse the original sensitive data (such as user's gender, address, etc.), while ensuring the authenticity and usability of the results. In fact, a more difficult question is: in any case to hide and encrypt the original data, the user always have a little bit of confidence in the heart of the Shadow. As a result of this shadow, the user will never believe a purely technical privacy protection program. It can be foreseen that in the future, the resolution of the privacy problem will become an important basis for users to select products.
However, the big data has come to people's side. Every person in the society today is actually a user of large data. At the same time, constantly exposing their privacy. For example, users use a free e-mail account, even if they know that the service providers are digging up our email messages, and users use search engines to ask questions, even though it is a matter of record. In that case, why are users still so happy when they use large data services? The answer lies in the cost ratio of user benefits to privacy exposure: Users will agree to accept and share the data if the user gains more than the value of the personal data disclosure. Therefore, the key to privacy issues is how to let the system and users find a balance in the contradictions.
Finally, with the development of technology, social acceptance of data sharing will change. The next generation may not be a big problem when it is unacceptable to a generation. Facebook is an example: the real-name real-name allows people to access other people's home page, and see a lot of information, which in the beginning caused no small doubt, but finally, the broad masses of young people warmly embrace the new technology, and have joined in.
Q5: operator pipeline vs Internet user large data?
The relationship between the Internet and telecom operators can be understood in one example: the various vehicles on the road can be regarded as the Internet, and the goods, passengers and transportation systems on the vehicles can be regarded as data and various applications of the Internet, and the highways of the vehicles are similar to those provided by the operators. For the internet, it is more concerned about passengers and goods, and how to send them safely to their destinations. But from the operator's point of view, it is more concerned about the smooth road. From this point of view, the Internet data about passengers and goods, operator data is the extent of traffic flow, road congestion. So, the Internet data is the end user data, the operator's data is about the data.
What is data about data? Take photos For example, pixel is the data, and photo size, type, photo file production time and place, is the data.
Data data in the telecommunications industry is of great significance. But the premise is that resources are limited at any time. The width of the pipe is also limited. So what do they want to know from an operator's point of view? or the analogy of cars and roads:
Do you want to know how to open a fast track for some important regulars? That's the first thing you need to know about important regulars. Only know the regulars of the group, regulars characteristics, can effectively crawl to them.
Do you want to know what important vehicles belong to companies that are attracted to rival highway companies and are considering a change? That is to analyze the pain points of these vehicle companies.
Do you want to know which sections need special maintenance and to send some permanent maintenance vehicles? Then you need to analyze which sections are susceptible to damage.
These requirements for data analysis are improved as the operational technology progresses. In the 5G scene, operators need to provide more intensive, faster, more personalized telecommunications services, which also know the user's use of the law, Pain points, service weaknesses in where. A high-end service that follows is not made up of countless waiters waiting where all the users may appear, but by a smart waiter in time when the user needs it. Future network technologies, such as software definition networks (SDN), require large data support: SDN's brain becomes smarter and wiser by the changes generated by the depth of the network's large data mining.
Q6: What is the relationship between big data and cloud computing and Internet of things?
If you think of the whole it process as a tree, the Internet of Things is the leaves and branches of the tree. If the sensor network is aware of information about "people", such as the information that a user buys on the Internet, or information about a person's mobile behavior and motivation, then it has great commercial value and the need for such data will increase dramatically. Human psychological factors are the most complex in the world, corresponding to a certain motivation and awareness of the behavior, action is also very different. The relationship between man and man consists of infinitely many dimensions, and large data is the superposition of these dimensions. The data include not only the language, text, motion, visual data, but also the relationship between people. All activities related to human activity data, are the most worthy of collection of data, and related to the needs of the permanent existence.
So, the data about people is big data. The Internet of things is most valuable only if it takes into account the general "things" of people. Otherwise, the data transmitted by the Internet of things is extremely limited, both in terms of complexity and commercial purpose.
The relationship between big data and cloud computing is that there are three essential requirements for successful applications of large data, in addition to "big": Real-time online, full description of events, and the effects of differentiation. Cloud computing has allowed these three prerequisites to be met.
First, cloud computing allows people to use storage and computing anytime, anywhere, so that large amounts of data can be collected and analyzed in time. The app cloud service on mobile phones is an example of cloud computing. Because of the lower storage and computational costs, cloud computing plays a real-time online role, making it possible for more people to use cloud services, which can be rolled up by large data snowballs.
Another benefit of cloud computing is that it enables large-scale data consolidation. The world today is not prepared for large data applications, because a large number of data sets are scattered in different places, stored in different ways, and the owners are different people. In the cloud computing environment, many large-scale data integration problems will be solved. When the data is put together, the threshold of data integration will be greatly reduced, so large data will be like the fusion of nuclear physics, have multiplied effect.
Q7: Do we need an expert if we have big data?
In the big Data age, a part of the role of experts can indeed be replaced by large data applications. For example, when recommending financial products, experts need to recommend specific financial products to specific customers. These customers have the following characteristics: the possibility of acceptance of this recommendation is high, and the impact on other customers is strong, in the acceptance of this product, they are likely to send product information to friends and family spread. This important market job used to be done by a professional marketing department manager. However, in large data applications, the recommended results of large-scale recommendation models based on the integrated analysis of large data are more than 20 times times that of marketing experts.
This example illustrates: first of all, in the traditional business world, the effect of large data can indeed replace and transcend human role. In the past practice, market experts can identify more than 10-D data, while the data mining model can handle tens of thousands of-or even tens of millions of-dimensional data; Secondly, it takes a lot of prophase work to get such a good result, such as building data platform, integrating different data, setting up analysis and forecasting model, and using the model to analyze and decide the future data.
These researchers have three salient features:
One is very strong control of data management systems and rapid programming capabilities,
Second, and business experts to communicate and understand the business objectives and constraints, analysis of data capabilities,
The third is the ability to connect to the model and predict the business decision. People with these abilities, we call them data scientists.
So, with the big data, the experts are still needed, but the role and focus of the experts in the decision-making process has changed: experts have not been able to enjoy success alone, but must work with large data systems to complete a complex task. Large data has been used by experts in the field of data analysis, but the expertise of experts in the field of value and experience is still irreplaceable. The establishment of a data analysis model needs to understand business and business objectives, which still require expert research and contribution, after all, the layman still can not lead the experts.
Q8: What's the best thing for big data?
The development of large data, like any other technology, requires a maturing process of "initialization-extreme inflation-greater disappointment-rational thinking-Successful application". Historically, examples of the sound of advanced technology disappearing in the process abound. Those successful technologies must withstand rational thinking and the test of time to find their most appropriate foothold in practice.
At present, the big data has a role may not have been noticed-large data can connect a large number of different islands of data, so that large data coverage of a wider range, but also make large data-driven business with the snowball rolling larger. In this way, users can continue to get new data, and the user can continue to receive new services.
From the current successful application of large data areas to infer that most applications should be focused on the storage of past events, extraction, and the aggregation of different data unicom, summary statistics. One of the important roles of large data aggregation is to correlate individual events that occur between different data. Discover the truth of events in real time through connectivity. With this data you can do the following analysis: What happens when an event occurs and what other events occur? How can past data be used to predict future events? How can you automatically suggest that some behavior is used to induce certain events to occur, or that certain events do not occur? and so on.
Q9: What can't big Data do?
Big data is obviously not omnipotent, so what can big data do?
No substitute for an effective business model. The application of large data can not be without business model, such as how to bring value to users in large data business, make data growth and business growth synchronized, etc. The business model is clearly not mined from large data itself, but is determined by experienced experts.
There can be no decision without leadership. In most companies today, the form of data is an isolated island. Combining these data is not only a technical activity, but also a strong management factor. Often in the same company different departments will have a lot of competition, data is a department of assets. While it would be valuable to aggregate several different data together, companies that can really do it will find it necessary to be more assertive in the success of data integration. This is why some of the most forward-looking companies will have a dedicated department responsible for the company's data business.
cannot be mined without a destination. Among the beginners of large data, there is a common misconception: when we have enough data, we can find knowledge aimlessly in it. Such an illusion is actually unscientific. Data mining needs constraints and goals, otherwise it is in vain to find a needle in a haystack. Kepler's success, for example, is based on an elliptical assumption that focuses on the sun.
No experts. As mentioned above, large data in different applications will require different expertise to guide. and different areas, the need for experts to participate in the degree will be different. The G O G e Lab has an example of a large number of pictures and video data that allows computers to automatically recognize cat faces. But such depth learning is hard to generalize to other big data areas. Because one of the prerequisites for success is that the domain itself has a very intuitive hierarchy, like the composition of a picture. If the data in a field does not have such a hierarchy, it is difficult to find the rule automatically in the same way. And such a structure needs to be defined by data scientists.
Can not be modeled once, lifelong benefit. A good model needs to be constantly updated, requiring lifelong learning (lifelong Machine Learning) to continually improve. For example, in Obama's campaign, scientists set up a user-voting model to predict where voters might stand, and the model was updated weekly with updated data.
Not good at doing global optimization analysis. The main data processing method under large data is "divide and conquer", that is, the large data is divided into small pieces, piece by piece, and then the result is merged. This process may have to go through many times, but the general idea is that the result of differentiation and merging is the same as the result of the global calculation. However, there are many problems that cannot be solved. For example, in Weiqi, the purpose of each piece may be related to the whole strategy, so the idea of divide and conquer is not workable.
You cannot have annotations to its semantics. At present, the data can only be given meaning by the identification. For example, the recommendation system does not work well without user feedback, and it is ineffective to enforce it through existing psychological models. In general, if you try to discover knowledge from your data, you need a lot of data annotations. This identification data is often available in an application that has direct interaction with the user. To obtain a large number of identification data, not only need a platform to carry a useful application, but also need a pair of people, the large data system of a mutually beneficial economic model.
You cannot use biased data only. The data must be fully reflected in the future and involved in all aspects. If the data is biased, it is difficult to judge the future effectively.
cannot be guaranteed to contain valid information. When the key features in the data are missing, large data cannot correct the deviation between the data and the reality, especially the data related to human psychology and behavior. The point is: Before the study, experts did not know which features were key features. For example, stock prices are affected by the "Black Swan" event, making it impossible to predict the probability of critical events with large data. This is like an input pipeline: Garbage input causes garbage output. This is why the actual box office of some films and the results obtained from the online evaluation of the data are contrary.
There is no guarantee of noise reduction. This is because in large data, the appearance of noise data is often in the form of meaningful patterns, thus cheating knowledge mining system. In this way, large data can cause more noise.
Q10: What are the technical trends in the post-big data age?
The change brought about by large data is only one step in the transformation of computer technology to the whole of mankind. The computer began a subtle revolution in human history since the 50 's. The fundamental symbol of this revolution is the digitization of human society and behavior, and the seamless integration of two of worlds (the physical and virtual worlds). In this revolution, one of the traditional industries of the human race was replaced by the digital industry: from the financial system to e-commerce, from robotics to driverless cars ...
Therefore, the big data changes and other important changes in human history are the same, need to go through the primitive accumulation of resources (that is, large data), business and social services differentiated, until mankind to the virtual world of the industry and society to standardize, to solve the data resource allocation. This historical process took more than 100 years in the last Industrial Revolution (the 18th century machine Revolution), but in this one revolution, it will happen in a quicker form.
The corollary is that the next generation of technology, triggered by big data, is likely to be a larger, digital-oriented shift, making it possible for many of the traditional industries in the physical world to turn to the digital world in full or in part. This shift has also led to the emergence of many areas in a different form, allowing many industries to change downstream from the overall "food chain". On that day, will the "tall" industry, such as doctors, scientists and teachers, become a "worker" for data collection and interpretation of the results of large-material transmission? Or become a partner of artificial intelligence robots driven by large data?
Yangqiang/professor, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Huawei Noah's Ark Laboratory (2012-2014)