Previous experience has taught us that the full expansion of the advantages will bring greater analytical value. But the big data [note] is not a universal hammer, and every problem is not a nail that can be solved by a hammer.
Many people think that big data means the bigger the better. People often interpret the question of "The bigger the better" from various philosophical perspectives. I summarize these angles as follows:
Belief: It means that larger, faster and richer types of data always bring more insight, which is the core value of large data analysis. If we fail to discover these insights, it is because we are not trying hard enough, or we are not sufficiently flexible, or we are not using the right tools and solutions.
Idol: It means that the sheer volume of data has its own value, and it has nothing to do with whether we can gain special insights from it. If we evaluate their usefulness solely on the basis of the specific business applications that they support, then we are inconsistent with the current needs of data scientists, and the need for data scientists to store data separately in the data lake to support future exploration efforts.
Burden: The sheer volume of data is not necessarily a good or bad thing, but an immutable fact is that they can put a lot of pressure on the storage and processing capabilities of existing databases and thus make new platforms for (Hadoop) a necessity. If we can't keep up with the pace of these new data growth, the core business needs will be forced to move to the new database.
Opportunity: In my opinion, this is a correct solution for large data. As data scales up to new levels, flows faster, and data sources and formats grow, the solution focuses on more efficient access to unprecedented insights. It does not use large data as a belief or an idol because it knows that even smaller data sizes can continue to gain many different insights. Nor does it view the scale of data as a burden but as a challenge that can be effectively addressed through new database platforms, tools and practices.
In 2013, I discussed the core use case of big data in my blog, but it only involved the "opportunity" part of the equation. Later on, I found that the core value of "big" in big data was derived from the ability to reveal the added context with added content. As you analyze the data to explore its full meaning, the more natural the background environment is, the better. Similarly, as you try to identify all variables, relationships, and patterns in your problem category to find a better solution, the more content is the better. In short, more and more content, combined with more and more background environments, often results in more and more data.
Another value of large data is that they can correct the errors generated by small-scale data. People who have observed the problem say that for data scientists, less data in training centers means they are more susceptible to multiple model risks. First, the small size of data may lead users to ignore critical predictive variables. At the same time, the user chooses not representative sample to cause the model to deviate the probability to be big. In addition, users may find false relationships that can be identified if the user has complete data that reveals the underlying relationship that actually works.
Scale is very important
All agree that some data types and use cases are more helpful than those that can bring new insights.
I recently stumbled upon a prediction pattern called Big Data: The bigger the better? Article, this paper elaborates on a specific category of data--sparse subdivision behavior data. In this respect, the scale of the data usually improves the forecast performance. Junquéde Fortuny, Martens and provost, author of the article, said: "An important problem with such datasets is that they are usually relatively scarce." For any given instance, most of the features are worthless, or the value is not shown. ”
The most noteworthy is that (the authors support their arguments by quoting rich research), which is the core of many large data applications focused on customer analysis. Social media behavior data, web browsing behavior data, mobile behavior data, advertising response behavior data and natural language behavior data all belong to this kind of data.
"In fact, the data used for predictive analysis are very similar for most predictive analytics business applications, such as directional marketing, credit scoring, and loss management in the financial and telecommunications sectors," the authors say. The characteristics of these products are focused on the individual's background, geographical and psychological characteristics, as well as some specific behaviors, such as preemptive behavior, which are summarized by statistics. ”
The key reason for "larger behavioral datasets tend to be better" is very simple (+ this site micro-letter networkworldweixin), the authors believe that "without a lot of data, some significant behavior may not be effectively observed." "This is because in a fragmented dataset, the person who is logged may only show a limited number of behaviors." But when you look at the entire population, you may observe at least once for each specific type of behavior, or in a particular environment. If the data is low, then the observed target and observed behavioral characteristics will be less, which will lead you to ignore many things.
The prediction model relies on the richness of the source behavior data set. To make predictions more accurate in future scenarios, the larger the data size is usually the better.
When the bigger is the blur
However, the author of the article also mentions some scenes. In these scenarios, the larger the better the assumption is not established, then we have to use the predictive value of the specific behavioral characteristics. At this point, trade-offs are the basis of a predictive behavioral model.
Each of the increased behavioral characteristics in the predictive model should be fully linked to the predictions made to enhance the learning-benefit and predictive ability of the model, overcoming the widening differences, i.e., excessive fitting and prediction errors, which usually result in a larger feature set. As the author of the article said, "a large number of unrelated features only increase the probability of difference and fit, without correspondingly increasing the probability of learning to a better model." ”
Obviously when "big" interferes with the acquisition of predictive insight, the bigger is not the better. Users do not want their large data analysis efforts to become victims of data scale expansion. Data scientists must also fully understand when to adjust the size of the data model to fit the analytical tasks in hand.