Titanium Media Note: Large data is too hot, is widely used in all walks of life, and the near stage has obvious signs of overheating. Is big data a marketing term or a methodology? The author, Lao Li, is a senior employee of a large data service provider, and his project is to conduct large data analysis for different industries. In his opinion, you must first have a basic understanding of big data, that is "a lot of data is not necessarily of value". In addition, data statistics are not equal to large data, data statistics and large data is the difference between artificial intelligence. Long Wen Shen:
In the past two years, "Big Data" has been widely used in all walks of life, and there are obvious signs of overheating in the near stage. From the spring movement of CCTV to Chen Yao saw micro bo data exclaimed; from the NPC and CPPCC period of large data, to "stars" are called the Beast's high and low collar sweater, "big data" by people pushed to an unprecedented height, but also from a sophisticated research direction into a well-known marketing vocabulary.
I am neither qualified to represent academia, nor qualified to judge who is not. I can only talk about my work experience, the big data in my eyes:
What is big data?
Baidu Encyclopedia on the definition of large data is this: large data, or huge amount of data, refers to the magnitude of the data involved can not be through the current mainstream software tools, in a reasonable time to achieve capture, management, processing, and collation to help the business decision-making more positive purpose information.
Gartner has given the definition that "big data" is a massive, high growth rate and diverse information asset that requires new processing models to have greater decision-making power, insight and process optimization capabilities.
Personally think Gartner's definition is more appropriate. "New processing Mode" is a key word, which is one of the most important features of "big data" which I understand is different from traditional statistical analysis. This so-called "new processing model" has two layers of meaning:
1, because the massive data, needs the more efficient storage and the processing technology, the Hadoop becomes the big data age the symbol;
2, if you think big data is equivalent to Hadoop, then it is wrong. Hadoop is just a necessary condition for the big Data age, and the big data has a clear sign that data mining and artificial intelligence are tightly integrated. This is one of the most obvious differences between "big data" and many so-called "big data" projects that I understand. I'll do it in a later case.
In addition to the above "new processing mode" On the difference, the individual believes that there is a major difference is: The data statistical analysis is based on the vertical classification of existing data, and large data is based on the processing of the existing mass of data, the data are not yet produced to make predictions and recommendations. Data statistics are what has happened, and large data is often used to predict or recommend things that haven't happened.
How are predictions and recommendations implemented?
At present, the main recommendation algorithms can be divided into two categories. One is based on behavior and one is based on content. Of course, for different areas, different forecasts and recommended objects, there will be more than 10 kinds of algorithms. This is not what this article expands.
Based on behavioral analysis, as the name suggests, that is, the user in the Internet, mobile internet left "traces", that is, browsing, click, collection, purchase, two purchase analysis, the future will choose to buy the forecast and recommended results. Based on the analysis of behavior, it belongs to the group intelligence and synthetically utilizes the behavior preference of the group users. Users will interact with each other, more in line with the real world of user behavior.
Fig. 1, the recommendation funnel algorithm based on behavior of electric quotient
Based on the content analysis, including the text, pictures, audio, video and other information analysis, to obtain predictions and recommendations of the conclusions. The content of the "gene" and the user's preferences match, the most representative of Pandora's music recommendations, which will be all songs in the library by more than 400 experts tagged, and then build personal and music links, so as to complete the music recommendation. The analysis of the content is only personal and has nothing to do with the relationship between users.
What does big data really do?
Now talk about this problem may make everyone laugh, it seems that everyone knows the big data can do this, able to do that, and finally even we feel ridiculous. Large data has not been "demonized", is "entertainment." The big data seems to be something far and near to us, and it becomes unreal.
Well, I'm going to combine the experience to say big data "solves what problem": Simply put, big data can help us to solve the problem of decision and choice.
The weather forecast is one of the oldest and most well-known predictions. You can depend on the forecast to decide what to wear tomorrow, whether to take an umbrella, etc.
In the past two years, large data has been applied to the film and television production industry, based on the analysis of audience preferences, to predict, design audiences like the plot, find the audience favorite actors in the relevant role, and even to predict the box office. All of these predictions are based on the data, and after a certain model processing, get close to the real conclusion. To some extent the basis for decision makers, such as the House of Cards and the stars.
Large data also has an important role in solving people's "choice" problem. Don't laugh, no matter your age, sex, education background, people are now facing an unprecedented choice problem. Academic, this is due to the "long tail effect" caused by the problem, speaking more commonly, is due to the increasing number of selectable objects and our own ability to deal with the contradictions between.
Advances in technology make people more lazy, that is, our own ability to deal with the reduction, whether subjective or objective. But the number of objects that can be chosen is increasing. From the complex of goods (electricity quotient), to the music of the massive music library, from the Love and marriage website's boyfriend and girlfriend, to traffic control signal.
Based on the large data under artificial intelligence, it is a means that can make people "become lazy". Based on your historical behavior, determine your possible preferences, and even the needs, the best results, recommended to you. That's the big data, she's your intimate housekeeper, or the best friend you know.
One of the most classic cases is Wal-Mart's "Beer" and "diaper" research: Wal-Mart has found that customers often purchase diapers while buying beer. Diapers and beer are naturally unrelated to the two categories of goods, from personal experience, simply do not think the connection between the two. It was later discovered that this was caused by a class of social phenomena. There are many young couples in the United States, after the use of diapers, the hostess home with children, and the male owner went to the supermarket to buy diapers. After the diaper is bought, the male owner usually comes along and buys some beer.
The above example shows that data often allows you to discover seemingly irrational illogical but existing and often occurring phenomena.
For example, traffic jams in Beijing are things that people on earth know. Especially the morning and evening peaks, which no longer need to be predicted. But if according to the historical traffic data, then passes through the mathematical model, calculates a whole Beijing best traffic signal lamp management system, this belongs to the Big Data category.
Figure 2, taxi daily distribution map
That's the biggest difference in my eyes. The largest data is the most common statistical analysis: data statistics can help you find disease, but large data can not only help you find, but also help you treat disease.
The big data is by no means "gimmick", and the indicators have been greatly improved in our reading recommendations to help an operator read the base. And this ascension is not scores, but several times ascension! (user per capita traffic increased 4 times times, silent user activation ability to improve 6.5 times times) This is the charm of large data.
Big Data isn't everything.
Big data is clearly not everything. That's why she's real. Large data in some areas for a variety of reasons, the value is not as high as expected. The main problems leading to this phenomenon are two, one is due to the quality or quantity of the data itself, the other is the algorithm is not appropriate.
Do not think that is a huge amount of data will certainly have value, in the past, we often found that from the data source of the party is is useless. Only 10%-20% of the data will produce a certain value. This reminds me of the metaphor marry Meeker, "the work of big data is like looking for a needle in a pile of straw."
Moreover, most areas of their own business are early and have very poor data. Cold start and sparsity are the challenges that large data faces in many fields.
On the other hand, for different fields, different projects, there is no universal algorithm, must be based on specific problems to solve. Found in the actual work, not just different areas (such as the article recommended and the recommendation of the product), or even different units in the same field (same as the electrical business but not similar to the electrical business, such as mother and child type and clothing or luxury goods) are also different.
Cross-use of data
The biggest problem of the two large data mentioned above is the lack of data during cold start and the sparsity of early data in business, which is not without medicine. The data that the industry has been talking about is the way out for solving these two problems.
For some emerging areas, the lack of data is inevitable, and on the other hand, because of the lack of data support, it is more necessary to have a strong decision support system for its business guidance and support, in order to achieve less detours, maximize the interests of the purpose.
Mobile Internet in the field of projects, especially representative. Although in the past two or three years, the mobile internet has been a high-speed development, but after all, in all aspects of accumulation, can not be compared with the Internet. The data does not have much value and meaning until people form a stable habit of using it.
But if you can get the Internet data and mobile Internet data through, then we have mastered this person's preferences and other aspects of information, so as to provide more effective mobile internet business guidance and help.
Figure 3, Internet and mobile Internet data access
Of course, data access is by no means limited to the Internet and mobile Internet. Data from each data source often portrays a different aspect of a person. As Professor Barabasi in the book "Eruption", if the data are adequate, 93% of human behavior is predictable and regular.
Only by organizing data from these different sources can the more meaningful information be mined.
Nowadays, many people in the industry are playing the banner of "Data statistics and analysis" to enlarge the data, which makes many laymen fall into the mistaken idea: data statistics are not equal to large data. Whether it's data statistics or big data, it's all about making our work more effective and making decisions more rational and accurate. The importance of data is itself a sign of enterprise maturity.
The rapid rise of mobile internet makes the data more diverse and rich. Its mobility, its fragmentation, its privacy and at any time to make up for the user to leave the desktop after the data, so with the original Internet data is very good to sketch out a day of life, daily living data.
With the further enrichment and improvement of data, with different channels of data access and cross utilization, the imagination of large data will certainly be more extensive.