Unsupervised learning, attention, and other mysteries
Get notified when we free Report "The future of the machine intelligence:perspectives from leading practitioners" is Availab Le for download. The following interview is a one of many that'll be included in the report.
Ilya Sutskever is a-scientist at Google and the author of numerous publications on neural networks and related to Pics. Sutskever is a co-founder of Dnnresearch and was named Canada ' s first Google Fellow.
Key takeaways:
- Since humans can solve perception problems very quickly, despite we neurons being relatively slow, moderately deep and LA Rge Neural Networks has the enabled machines to succeed in a similar fashion.
- Unsupervised learning is still a mystery, but a full understanding of this domain have the potential to fundamentally trans form the field of machine learning.
- Attention models represent a promising direction for powerful learning algorithms this require ever less data to be succes Sful on harder problems.
David Beyer: Let's start with your background. What is the evolution of your interest in machine learning, and how do you zero-in on your Ph.D. work?
Ilya sutskever: i started my Ph.D. just before deep learning became a thing. I was working on a number of different projects, mostly centered around neural networks. My understanding of the field crystallized when collaborating and James Martins on the Hessian-free optimizer. At the time, greedy layer-wise training (training one layer at a time) was extremely popular. Working on the Hessian-free optimizer helped me understand so if you just train a very large and ; Deep neural Network in a lot of data, and you'll almost necessarily succeed.
Taking a step back, when solving naturally occurring machine learning problems, you use some model. The fundamental question is whether you believe, the this model can solve the problem for some setting of its parameters. If the answer is no and then the model won't get great results, no matter how good its learning algorithm. If The answer is yes, then it's only a matter of getting the data and training it. Some sense, the primary question. Can the model represent a good solution to the problem?
There is a compelling argument this large, deep neural networks should being able to represent very good solutions to perc Eption problems. It goes like This:human neurons is slow, and yet humans can solve perception problems extremely quickly and accurately. If humans can solve useful problems in a fraction of a second and then you should only need a very small number of massively- Parallel steps in order to solve problems like vision and speech recognition. This was an old argument-i ' ve seen a paper on the early 80s.
This suggests so if you train a large, the deep neural network with ten or layers on the something like vision, then you could basically solve it. Motivated by this belief, I worked with Alex Krizhevsky toward demonstrating it. Alex had written an extremely fast implementation of 2D convolutions on a GPUs, at a time when few people knew how to code For GPUs. We were able to train neural networks larger than ever before and achieve much better results than anyone else at the time .
Nowadays, everybody knows so if you want to solve a problem, you just need to get a lot of data and train a big neural n Et. You might isn't solve it perfectly, but can definitely solve it better than you could has possibly solved it without de EP Learning.
DB: Not to trivialize what do you ' re saying, but you say throw a lot of data at a highly parallel system, and you ' ll basically fi Gure out what do you need?
is : Yes, But:although the system is highly parallel, it's its sequential nature that gives you the power. It's true we use parallel systems because, the only-to-make it fast and large. But if your think of what depth represents-depth are the sequential part.
And if you look at your networks, you'll see this each year they is getting deeper. It's amazing to me-these very vague, intuitive arguments turned out to correspond-what's actually happening. The networks that does best in vision is deeper than they were before. Now we had 25-layer computational steps, or even more, and depending on how do you count.
DB: What is the open problems, theoretically, in making deep learning as successful as it can be?
is : The huge open problem would be-to-figure out how-to-do + with less data. How does the This method less data-hungry? How can I input the same amount of data, but better formed?
This ties in with the one of greatest open problems on machine learning-unsupervised learning. How does even think about unsupervised learning? How does benefit from it? Once our understanding improves and unsupervised learning advances, this is where we'll acquire new ideas, and see a COM Pletely unimaginable explosion of new applications.
DB: What's our current understanding of unsupervised learning? And how are it limited in your view?
is : Unsupervised learning is mysterious. Compare it to supervised learning. We know why supervised learning works. You had a big model, and you ' re using a lot of data to define the Cost-the training Error-which you minimize. If you had a lot of data, your training error would be is close to your test error. Eventually, you get to a low test error, which are what you wanted from the start.
But I can ' t even articulate what's it is we want from unsupervised learning. you want something; You want the model to understand, whatever that means. Although we currently understand very little about unsupervised learning, I am also convinced so the explanation is Righ T under our noses.
DB: Is aware of any promising avenues that people is exploring toward a deeper, conceptual understanding of why Unsuperv Ised Learning does what it does?
is : There is plenty of people trying various ideas, mostly related to density modeling or generative models. If you ask any practitioner what to solve a particular problem, they would tell you to get the data and apply supervised Lea Rning. There is not yet an important application where unsupervised learning makes a profound difference.
DB: Do we have any sense of the what success means? Even a rough measure of how well an unsupervised model performs?
is : Unsupervised learning is always a means for some and other end. In supervised learning, the learning itself are about. You ' ve got your cost function, which you want to minimize. In unsupervised learning, the goal are always to help some other tasks, like classification or categorization. For example, I might ask a computer system to passively watch a lot of YouTube videos (so unsupervised learning happens he RE), then ask it to recognize objects with great accuracy (that ' s the final supervised learning task).
Successful unsupervised learning enables the subsequent supervised learning algorithm to recognize objects with accuracy t Hat would not being possible without the use of unsupervised learning. It's a very measurable, very visible notion of success. And we haven ' t achieved it yet.
DB: What is some other areas where do you see exciting progress?
is : A General Direction This I believe to being extremely important is:are learning models capable of more sequential computatio Ns? I mentioned how I think the deep learning are successful because it can do more sequential computations than previous ("Sh Allow ") models. And so models the can do even more sequential computation should being even more successful because they is able to express More intricate algorithms. It's like allowing your parallel computer to run for more steps. We already see the beginning of this, in the form of attention models.
DB: And how does attention models differ from the current approach?
is : In the current approach, you take your input vector and give it to the neural network. The neural network runs it, applies several processing stages to it, and then gets an output. In an attention model, you have a neural network and you run the neural network for much longer. There is a mechanism in the neural network, which decides which part of the the input it wants to ' look ' at. Normally, if the input is very large, you need a large neural network to process it. If you have a attention model, you can decide on the best size of the neural network, independent of the size of the Input.
DB: So then, how does the decide where to focus this attention in the network?
is : Say You has a sentence, a sequence of, Say, words. The attention model would issue a query on the input sentence and create a distribution over the input words, such that a W Ord that's more similar to the query would have higher probability, and words that's less similar to the query would have Lower probability. Then you take the weighted average of them. Since every step is differentiable, we can train the attention model where to look with backpropagation, which is the Reas On to its appeal and success.
DB: What kind of changes does you need to the framework itself? What new code does need to insert this notion of attention?
is : Well, the great thing about attention, at least differentiable attention, was that you don ' t need to insert any new code to The framework. As long as your framework supports element-wise multiplication of matrices or vectors, and exponentials, that's all you NE Ed.
DB: So, attention models address the question you asked Earlier:how does we make better use of existing power with less data?
is : That ' s basically correct. There is many reasons to being excited about attention. One of them is this attention models simply work better, allowing us to achieve better results with less data. Also, bear on mind, humans clearly has attention. It is something this enables us to get results. It's not just an academic concept. If you imagine a really smart system, surely, it, too, would have attention.
DB: What is some of the key issues around attention?
is : Differentiable attention is computationally expensive because it requires accessing your entire input at each step of the Model ' s operation. And this was fine when the input was a sentence that's only, say, words, but it's not practical when the input was a 10,0 00-word document. So, one of the main issues are speed. Attention should was fast, but differentiable Attention was not fast. Reinforcement learning of attention is potentially faster, but training attentional control using reinforcement learning O Ver thousands of objects would be non-trivial.
DB: Is there a analog, in the brain, as far as we know, for unsupervised learning?
is : The brain is a great source of inspiration if looked at correctly. The question of whether the brain does unsupervised learning or not, depends to some extent on what you consider to be UNS Upervised Learning. In my opinion, the answer is unquestionably yes. People behave, and notice that people is not the really using supervised learning at all. Humans never use any supervision of any kind. You start reading a book, and your understand it, and all of the a sudden you can do new things so you couldn ' t do before. Consider a child, sitting in class. It's not like the student are given a lot of input/output examples. The supervision is extremely indirect; So, there's necessarily a lot of unsupervised learning going on.
DB: Your work is inspired by the human brain and its power. How far does the neuroscientific understanding of the brain extend to the realm of theorizing and applying machine learn Ing?
is : There is a lot of value of looking at the brain, but it had to be do carefully, and at the right level of abstraction. For example, our neural networks has units that has connections between them, and the idea of using slow interconnected Processors was directly inspired by the brain. But it's a faint analogy.
Neural networks is designed to being computationally efficient in software implementations rather than biologically PLAUSIBL E. But the overall is inspired by the brain, and is successful. For example, convolutional neural networks echo we understanding that neurons in the visual cortex has very localized PE Rceptive fields. This is something, which is known about the brain, and this information have been successfully carried over to our models. Overall, I think there is the value of studying the brain if done carefully and responsibly.
Public domain image on article and category pages via the Google Art Project on Wikimedia Commons.
Unsupervised learning, attention, and other mysteries