Deep learning and the Triumph of empiricism

Source: Internet
Author: User

Deep learning and the Triumph of empiricism

by Zachary Chase Lipton, July 2015

Deep Learning are now the standard-bearer for many tasks in supervised machine learning. It could also is argued that deep learning have yielded the most practically useful algorithms in unsupervised machine Lear Ning in a few decades. The excitement stemming from these advances have provoked a flurry of the and sensational headlines from journalists. While I am wary of the hype, I too find the technology exciting, and recently joined the party, issuing a 30-page critical Review on Recurrent neural Networks (Rnns) for sequence learning.

But many in the machine learning the community is not fawning over the deepness. In fact, for many who fought to resuscitate artificial intelligence (grounding), it in the language of mathematic s and protecting it with theoretical guarantees, deep learning represents a fad. Worse, to some it might seem to be a regression.1

In this article, I'll try to offer an high-level and even-handed analysis of the useful-ness of theoretical guarantees and Why they might isn't always being as practically useful as intellectually rewarding. More to the point, I'll offer arguments to explain why after so many years of increasingly statistically sound machine Lea Rning, many of today ' s best performing algorithms offer no theoretical guarantees.

Guarantee what?

A guarantee is a statement so can be made with mathematical certainty about the behavior, performance or complexity O F an algorithm. All else being equal, we would love-say that given sufficient time our algorithm A can find a classifier H from some CL Models {H1, H2, ...}, performs no worse than h*, where h* is the best classifier in the class. This was, of course, with respect to some fixed loss function L. Short of that, we ' d love to bound the difference or ratio Between the performance of H and h* by some constant. Short of such a absolute bound, we ' d love to being able to prove so with high probability H and h* is give similar values After running we algorithm for some fixed period of time.

Many existing algorithms offer strong statistical guarantees. Linear regression admits an exact solution. Logistic regression is guaranteed to converge over time. Deep learning algorithms, generally, an offer to nothing in the guarantees. Given an arbitrarily-starting point, I know of no theoretical proof the A neural network trained by some variant of S GD would necessarily improve over time and is trapped in a local minima. There is a flurry of recent work which suggests reasonably, saddle points outnumber local minima on the error surfaces of neural networks (an m-dimensional surface where m are the number of learned parameters, typically the weights on edges Between nodes). However, this was not the same as proving the local minima does not exist or that they cannot was arbitrarily bad.

Problems with Guarantees

Provable mathematical properties are obviously desirable. They may has even saved machine learning, giving succor at a time when the field of AI is ill-defined, over-promising, a nd under-delivering. And yet many of today's best algorithms offer nothing in the guarantees. How are this possible?

I ' ll explain several reasons in the following paragraphs. They include:

    1. Guarantees is typically relative to a small class of hypotheses.
    2. Guarantees is often restricted to worst-case analysis, but the real world seldom presents the worst case.
    3. Guarantees is often predicated on incorrect assumptions about data.

Selecting a Winner from a Weak Pool

To begin, theoretical guarantees usually assure, a hypothesis are close to the best hypothesis in some given class. This on no-guarantees that there exists a hypothesis in the given class capable of performing satisfactorily.

Here's a heavy handed example:i desire a human editor to assist me in composing a document. Spell-check may come with guarantees on how it would behave. It'll identify certain misspellings with 100% accuracy. But existing automated proof-reading tools cannot provide the insight offered by an intelligent human. Of course, a human offers nothing in the mathematical guarantees. The human may fall asleep, ignore my emails, or respond nonsensically. Nevertheless, a he/she is capable of expressing a far greater range of useful ideas than Clippy.

A cynical take might is that there is the ways to improve a theoretical guarantee. One is to improve the algorithm. Another is to weaken the hypothesis class of which it is a member. While neural networks offer little in the "the" of guarantees, they offer a far richer set of potential hypotheses than most Better understood machine learning models. As heuristic learning techniques and more powerful computers has eroded the obstacles to effective learning, it seems CLE AR-for-many models, this increased expressiveness are essential for making predictions of practical utility.

The worst case is not Matter

Guarantees is most often given in the worst case. By guaranteeing a result that's within a factor epsilon of optimal, we say that the worst case would be no worse than a FA ctor Epsilon. But in practice, the worst case scenario may never occur. Real world data are typically highly structured, and worst case scenarios may has a structure such that there is no Overla P between a typical and pathological dataset. In these settings, the worst case bound still holds, but it could be the case that's all algorithms perform much better. There may is a reason to believe the algorithm with the better worst case guarantee would have a better typical CA SE performance.

Predicated on provably incorrect assumptions

Another reason why models with theoretical soundness could not translate into real-world performance are that the Assumpti ONS about data necessary to produce theoretical results is often known to be false. Consider latent Dirichlet Allocation (LDA) For example, a well understood and remarkably useful algorithm for topic modeli Ng. Many theoretical proofs about LDA was predicated upon the assumption that a document was associated with a distribution ove R topics. Each topic are in turn associated with a distribution over all words in the vocabulary. The generative process then proceeds as follows. For each word in a document, first a topic are chosen stochastically according to the relative probabilities of each topic. Then, conditioned on the chosen topic, a word was chosen according to that topic ' s word distribution. This process repeats until all words is chosen.

Clearly, this assumption the does not hold on any Real-world natural language dataset. In real documents, words is chosen contextually and depend highly on the sentences they is placed in. Additionally document lengths aren ' t arbitrarily predetermined, although this is the case in undergraduate coursework. However, given the assumption of such a generative process, many elegant proofs about theoretical properties of LDA hold.

To was clear, LDA is indeed a broadly useful, state of the art algorithm. Further, I am convinced that theoretical investigations of the properties of algorithms, even under unrealistic assumption S is a worthwhile and necessary step to improve we understanding and lay the groundwork for more general and powerful the Orems later. In this article, I seek only to contextualize the nature of much known theory and to give intuition to data science practi Tioners about why the algorithms and the most favorable theoretical properties is the best performing empiric Ally.

The Triumph of empiricism

One might ask, If not guided entirely by theory, what allows methods like deep learning to prevail? Further why is empirical methods backed by intuition so broadly successful now even as they fell out of favor decades Ago?

In answer to these question, I believe then the existence of comparatively humongous, well-labeled datasets like ImageNet is responsible for resurgence in heuristic methods. Given sufficiently large datasets, the risk of overfitting are low. Further, validating against test data offers a means to address the typical case, instead of focusing in the worst case. Additionally, the advances in parallel computing and memory size has made it possible to follow-up on many hypotheses Sim ultaneously with empirical experiments. Empirical studies backed by strong intuition offer a path forward when we reach the limits of our formal understanding.

Caveats

For all the success of deep learning in machine perception and natural language, one could reasonably argue He three most valuable machine learning algorithms is linear regression, logistic regression, and K-means clustering, all of which is well-understood theoretically. A reasonable counter-argument to the idea of a triumph of empiricism might is that far the best algorithms is th  Eoretically motivated and grounded, and that empiricism are responsible only for the newest breakthroughs, not the Most significant.

Few Things is Guaranteed

When attainable, theoretical guarantees is beautiful. They reflect clear thinking and provide deep insight to the structure of a problem. Given a working algorithm, a theory which explains its performance deepens understanding and provides a basis for further Intuition. Given the absence of a working algorithm, theory offers a path of attack.

However, there is also beauty in the idea of well-founded intuitions paired with rigorous empirical study can yield cons istently functioning systems, outperform Better-understood models, and sometimes even humans at many important tasks. Empiricism offers a path forward for applications where formal analysis is stifled, and potentially opens new directions T Hat might eventually admit deeper theoretical understanding in the future.

1Yes, Corny Pun.

Zachary Chase Lipton is a PhD student in the computer science Engineering Department at the University of Califor Nia, San Diego. Funded by the Division of Biomedical Informatics, he's interested in both theoretical foundations and applications of Mac Hine Learning. In addition-to-his work at UCSD, he had interned at Microsoft Labs.

Related:

    • Fast:questioning Deep Learning IQ Results
    • The myth of Model interpretability
    • (Deep Learning's deep Flaws) ' s deep Flaws
    • Data Science's most used, confused, and abused Jargon
    • Differential privacy:how to make Privacy and Data Mining Compatible

Deep learning and the Triumph of empiricism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.