In machine learning, are more data always better than better algorithms?

Source: Internet
Author: User

In machine learning, are more data always better than better algorithms?

No. There is times when more data helps, there is times when it doesn ' t.

Probably One of the most famous quotes Defen Ding the power of data is that of Google ' s Director peter norvig claiming that" We Don has better algorithms. We just has more data. ". This quote was usually linked to the article on "the Unreasonable effectiveness of Data", co-authored by norvig  Himse LF (you should probably is able to find the PDF on the Web Although the originalis Behi nd the IEEE paywall). The last nail on the coffin of better models was when Norvig was misquoted as saying that ' All models was wrong, and you don ' t need them anyway ' (Read here for the author ' s own clarifications on how he was misquoted).



The effect that Norvig et. Al were referring to in their article, had already been captured years before in the famous paper by Microsoft researchers Banko and Brill [2001] "Scaling to Very Very Large corpora for Natural Language disambiguation". In that paper, the authors included the plot below.


That's shows that, for the given problem, very different algorithms perform virtually the same. However, adding more examples (words) to the training set monotonically increases the accuracy of the model.

So, case closed, might think. Well ... not so fast. The reality is that both Norvig's assertions and Banko and Brill ' s paper are right ... in a context. But, they is now and again misquoted in contexts that is completely different than the original ones.  But, on order to understand why, we need to get slightly technical. (I don ' t plan on giving a full machine learning tutorial in the This post. If you don ' t understand what I explain below, read my answer toHow does I learn machine learning?)

Variance or Bias?

The basic idea is, there is, possible (and almost opposite) reasons a model might not perform well.

In the first case, we might has a model that's too complicated for the amount of data we have. This situation, known asHigh Variance, leads to model overfitting. We know that we were facing a high variance issue when the training error is much lower than the test error. High variance problems can is addressed by reducing the number of features, and ... Yes, by increasing the number of data points. So, what kind's models were Banko & Brill ' s, and Norvig dealing with? Yes, you got it right:high variance. In both cases, the authors were working on language models in which roughly every word in the vocabulary makes a feature. These is models with many features as compared to the training examples. Therefore, they is likely to overfit. And, yes, in this case adding more examples would help.

But, with the opposite case, we might has a model that's too simple to explain the data we have. In this case, known asHigh bias, adding more data would not be help. See below a plot of a real production system at Netflix and it performance as we add more training examples.


So, no,More data does. As we have just seen there can is many cases in which adding more examples to our training set would not improve the model Performance.

More features to the rescue

If you were with me so far, and you had done your homework in understanding high variance and high bias problems, you migh T is thinking that I had deliberately left something out of the the discussion. Yes, the high bias models would not be benefit from the more training examples, but they might very the well benefit from the more features. So, in the end, it's all about adding ' more ' data, right? Well, again, it depends.

Let's take the Netflix Prize, for example. Pretty early on ' the game, there wasa blog post by serial entrepreneur and Stanford Professor Anand Rajaraman commenting on the use of extra features To solve the problem. The Post explains how a team of students got an improvement on the prediction accuracy by adding content features from IMD B.



In retrospect, it's easy-to-criticize the post for making a gross over-generalization from a single data point. Even more, theFollow-up postreferences SVD as one of the "complex" algorithms not worth trying because it limits the ability of scaling Up to larger number of features. Clearly, Anand's students did not-not-win the Netflix Prize, and they probably now realize this SVD did has a major role in T He winning entry.

As a matter of fact, many teams showed later that adding content features from IMDB or the like to an optimized algorithm Had little to no improvement. Some of the Gravity team, one of the top contenders for the Prize, published a detailed paper in which They showed how those content-based features would add no improvement to the highly optimized collaborative filtering Matr IX factorization approach. The paper was entitled "recommending New Movies:even a FEW ratings is more valuable Than Metadata".


To being fair, the title of the paper is also an over-generalization. Content-based features (or different features in general) might is able to improve accuracy in many cases. But, you get my point again:More data does.

Better Data! = More Data (Added This sectionIn response to a comment)

It is important to point out of that, in my opinion, and better data is always better. There is no arguing against. So any effort you can direct towards "improving" your data are always well invested. The issue is, better data does not mean MoreData. As a matter of fact, sometimes it might mean Less!

Think of data cleansing or outlier removal as one trivial illustration of my point. But, there is many other examples that is more subtle. For example, I has seen people invest a lot of effort in implementing distributed matrix factorization when the truth is that they could has probably gotten by with Sam Pling their data and gotten to very similar results. In fact, doing some form of smart sampling on your population the right (e.g. using stratified sampling) can get t O Better results than if you used the whole unfiltered data set.

The End of the scientific Method?

Of course, whenever there is a heated debate about a possible paradigm change, there be people like Malcolm GLADW Ell or Chris Anderson that make a living out of heating it even more (don ' t get me wrong, I am a fan of both, and has REA D most of their books). In this case, Anderson picked on some of Norvig ' s comments, and misquoted them in an article entitled: "the End of theory:the Data Deluge makes the scientific Method Obsolete".


The article explains several examples of how the abundance of data helps people and comp Anies take decision without even have to understand the meaning of the data itself. As Norvig himself points out In his rebuttal, Anderson had a few points right, but goes Above and beyond to try to make them. And the result is a set of false statements, starting from the title:the data deluge does not make the scientific method Obsolete. I would argue it is rather, the other-the-around.

Data without a sound approach = Noise

So, am I trying "to" the point and the Big Data Revolution is only hype? No. Has more data, both in terms of more examples or more features, is a blessing. The availability of data enables more and better insights and applications. More data indeed enables better approaches. More than, it requires  better approaches.

< Span class= "Qlink_container" >
in summary, we should dismiss simplistic voices that Proclaim the uselessness of theory or models, or the triumph of data over these. As much as data is needed, so was good models and theory that explains them. But, overall, what we need are good approaches that help us understand how to interpret data, models, and the limitations O F both in order to produce the best possible output.

In the other words, data is important. But, the data without a sound approach becomes noise.

( Note : This answer was based on a post that i  previously published on my Blog:more data or better models?)

In machine learning, are more data always better than better algorithms?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.