Dialogue machine learning Great God Yoshua Bengio (Next)
Professor Yoshua Bengio (Personal homepage) is one of the great Gods of machine learning, especially in the field of deep learning. Together with Geoff Hinton and Professor Yann LeCun (Yan), he created the deep learning renaissance that began in 2006. His work focuses on advanced machine learning and is dedicated to solving AI problems. He is one of the few deep-learning professors still devoted to academia, and many other professors have already been involved in industry, joining Google or Facebook.
As an active participant in the machine learning community, Professor Yoshua Bengio attended the "Ask Me Anything" event in the famous Community Reddit machine learning Section One o'clock in the afternoon to two o ' est February 27, Yoshua answered many questions about machine learning enthusiasts, Dry goods frequently. Pretending this, for the other side of the Earth in the domestic AI and machine learning enthusiasts to study the discussion, the order of all questions and answers by the Reddit user vote decision. Here is the bottom half of the question:
Question and Answer upper part: Http://www.infoq.com/cn/articles/ask-yoshua-bengio
Q : As far as I know, you are the only public scientist in the field of machine learning that studies sociology with deep learning. In your masterpiece "Culture vs Local minima", your exposition is very exciting, I have the following questions to look forward to your answer:
- In this article, you describe how individuals learn by immersing themselves in society. As we all know, individuals often fail to learn much about the general situation. If you are the master of the world, you have the ability to set some ideas, so that all individuals from childhood began to learn, how do you choose these ideas?
- The inevitable result of "cultural immersion" is that the individual is not aware of the whole learning process, which is what the world is like. Writer David Foster Wallace once vividly likened it to "fish need to know what water is". In your opinion, is this phenomenon a byproduct of neural network structure or does it have some benefits?
- Do you think cultural trends affect individuals and cause them to rely on local optimization? such as the dispute between various religious institutions and the Philosophy of enlightenment, the conflict between paternalistic society and women's political participation. Is this kind of phenomenon beneficial or harmful?
- What do you think about meditation and cognitive space?
A: I am not a sociology or a philosophical scientist, so when you look at my answer, you need to use an analytical and dialectical perspective. My view is that very many individuals cling to their beliefs because they have become part of their identity and represent what kind of a group they are. It is difficult and terrible to change faith. I believe that a large part of our brain's work is to try to harmonize all our experiences to form a good worldview. Mathematically speaking, this problem is related to reasoning (inference), that is, individuals looking for appropriate interpretations (hidden variables) through the observed data. In a stochastic model, the inference process is done through a random exploration of a given configuration (e.g. Markov networks are completely randomized). The act of meditation, in a way, helps us to improve our reasoning ability. In meditation, some ideas come to light, and then we find it to be of universal significance. This is precisely the method of scientific progress.
Q : When discussing and accumulating the network (sum product network,spn), a member of the Google Brain team told me that he was not interested in computable models (tractable model), what do you think?
A: A variety of learning algorithms are different degrees of non-computational. Generally speaking, the simpler the model with the more computable model is, the weaker it is in terms of expressive ability. I do not have an exact calculation, and how much computing power will be lost after the product network splits the joint distribution. In general, the models I know are subject to non-computational effects (at least theoretically, the training process is very difficult). Models such as SVM are not affected by such effects, but if you do not find the right feature space, the universality of these models will be affected. (Finding is very difficult, and deep learning solves the problem of finding the feature space.)
User supplement: What is the computational nature of the model?
As far as the network is concerned, the computable meaning is that the reasoning ability of the model will not increase exponentially in the calculation requirements when adding more variables. Computable is a cost, and the product network can only show some specific distributions, details can refer to Poon and Dmingo papers.
In fact, all graph models can represent the product form of factors, as well as deep belief networks. The calculation of the graph model is mainly determined by the width of the graph (treewidth). Therefore, the low-width graph model is considered computable, and the high-width is non-computational, and people need to use MCMC, belief propagation (BP), or other approximate algorithms to find the answer.
Any graph model network can be converted into a similar and integrable network form (an arithmetic circuit, AC). The problem is that, in very bad cases, the network that the transformation generates is usually an exponential level. So, even if the reasoning is linear with the scale of the network, in the case of growth in the size of the graph model, the computational performance will also decline exponentially. However, it is worth mentioning that some index-level, or high-width, graph models can be converted into compact arithmetic circuits, so that we can still make inferences on them and then calculate them, a discovery that once made the diagram model community very exciting.
We can understand the AC and SPN as a kind of compact representation of the graph model context-independent way. They are able to represent a number of high-width graph models as compact forms. The difference between AC and SPN is that AC is converted through a Bayesian network, and the SPN is a direct representation of the probability distribution. So, instead of the traditional graph model of training, we can convert it into a tight call path (AC), or learn a tight call path (SPN).
Q :
- Why are deep networks better than shallow networks? As we all know, there is a hidden layer of the network is actually a global approximation, adding more full-unicom level will usually improve the effect, this situation has no theoretical basis? The papers I've come in contact with claim to have actually improved the results, but all vague.
- Which one do you prefer, in the opinion you don't have?
- What's the funniest or strangest paper you've ever reviewed?
- If I'm not mistaken, do you teach in French, is this a hobby or a school requirement?
A: The global approximation does not tell you how many hidden layers you need. For indeterminate functions, increasing the depth does not improve the effect. However, if the function can be split into the form of variable combinations, the depth can play a significant role, whether from the statistical significance (fewer parameters required training data), or from the computational significance (less parameters, the calculation of small) in terms of.
I teach in French because the official language of Montreal University is French. However, three-fourths of my graduates are not in French as the main language, feeling no influence. About living in Montreal, my students have written a life description that is available for reference to the students who apply. Montreal is a big city, with four universities, a very strong cultural atmosphere, close to nature, quality of life (including safety) all North America row fourth. The cost of living is much lower than in other similar cities.
Q : As we all know, deep learning has made breakthroughs in images, video and sound, do you think it will make progress on text categorization? Most of the deep learning used for text categorization, the results look similar to traditional SVM and Bayes consulting, what do you think?
A: I have a hunch that deep learning will definitely have a very big impact on natural language processing. The effect has actually been generated, as I have in Nips 2000 and JMLR 2003 years of paper: the use of a learned attribute vector to represent words, so as to be able to model the probability distribution of word sequences in natural language text. The current work is mainly to learn the probability distributions of words, phrases and sentence sequences. You can take a look at Richard Socher's work, which is very deep. can also look at the work of Tomas Mikolov, he defeated the world record of language model with the recursive neural network, he studied the distribution, to some extent, revealed some nonlinear relationship between words. For example, if you subtract the attribute vector of "Roman" with the attribute vector of the word "Italy", plus the "paris" attribute vector, you can get the word "France" or similar meaning. Similarly, with "king" Minus "man" plus "woman", Can get "queen". This is very exciting, because his model has not been deliberately designed to do such a thing.
Q : I see more and more magazines reporting deep learning, which is called the path to real AI, and Wired magazine is the culprit. Given the low tide of AI in the 780 's (and the expectations of those at that time), what do you think deep learning and machine learning researchers should do to prevent similar recurrence?
A: My view is that there is still a scientific way to show the progress of the research (and in this regard, many companies that advertise their own deep research do not). Do not over-packaging, to be modest, can not be the current achievements of excessive consumption, but based on a long-term vision.
Q : First of all, your lab development Theano and pylearn2 are great. Four questions:
- What do you think about the Hinton and LeCun to the industry?
- What do you think is the value of academic research and published papers compared to making money in private companies?
- Do you think machine learning will become as much a field of time-series analysis as many studies are closed, with various intellectual property restrictions?
- Given the progress made in the discriminant neural network model, what do you think will be the future development of the model?
A: I think Hinton and LeCun into the industry will drive more and better industrial-grade neural network applications to solve really interesting large-scale problems. It's a pity that the field of deep learning may be short of a lot of offer for PhD students. Of course, there are many young researchers who have grown up in the field of deep research and are willing to recruit new students who are capable. Deep learning in the industry in-depth application, will drive more students to understand and understand this area, and join in it.
Personally, I like the freedom of academia rather than a few extra zeros on my salary. I think the academic community will continue to produce as the paper is published, and the industry's research institutes will remain enthusiastic.
The production model will become very important in the future. You can refer to my and Guillaume Alain articles on unsupervised learning (note that these are not synonyms, but they usually come together, especially after we have discovered the Automatic encoder (Auto-encoder) generation interpretation).
Q : Inspired by your work, I completed my undergraduate dissertation on Natural Language Processing (NLP) last year using probabilistic models and neural networks. At that time I was very interested in this, decided to engage in related fields of research, I am currently studying for graduate students, but also listened to some related courses.
However, after a few months, I found that NLP was not as interesting as I had imagined. Researchers in this field are a bit dull and stagnant, and of course this is my personal one-sided view. What do you think is the challenge in the NLP field?
A: I believe that the really interesting challenge in NLP, the key issue of "natural language understanding", is how to design learning algorithms to express semantics. For example, I am now studying the method of modeling word sequences (language models) or translating a sentence in one language into a sentence of the same meaning in another language. In both cases, we are trying to learn the expression of a phrase or sentence (not just a word). In the case of translation, you can think of it as an automatic encoder: The encoder (for example, in French) maps a French sentence to its semantic representation (represented by a common method), and the other decoder (for example, for English), can map this representation to some English sentences according to the probability distribution, These sentences have the same or approximate meanings as the original sentences. The same approach, we can obviously apply to text understanding, a little bit of extra work, we can do automatic questions and so on standard natural language processing tasks. At the moment we have not reached this level, the main challenge I think exists in the Numerical optimization section (when the training data is large, the neural network is difficult to train fully). In addition, there are challenges in computing: we need to train larger models (such as 10,000 times times more), and we obviously don't tolerate training times as much as 10,000 times times. Parallelization is not easy, but it can help. In the present situation, it is not enough to get a really good natural language understanding ability. A good natural language understanding, can pass some Turing test, and need the computer to understand the world to run a lot of knowledge. So we need to train not only to consider the model of the text. The semantics of a word sequence can be combined with the semantic representation of an image or video. As mentioned above, you can think of this bonding process as converting from one mode to another, or comparing the semantics of two modes. This is how Google Image search works at the moment.
Q: I am writing an undergraduate thesis on the philosophical aspects of science and logic. In the future I would like to transfer to the computer department for my master's degree and then my PhD in machine learning. What do you think people like me need to do to attract professors ' attention, in addition to bad math and programming?
For:
- Read in-depth study papers and tutorials, starting with introductory text and gradually improving the difficulty. Record reading experience and summarize the knowledge regularly.
- Implement the algorithm you've learned and start from scratch to make sure you understand the math. Don't just copy the pseudo-code you see in the paper and implement some variants.
- Using real data to test these algorithms, you can participate in the Kaggle contest. You can learn a lot by touching the data;
- Write your experience and results on the blog, contact the experts in the field, and ask if they would like to receive your remote cooperation on their project, or find an internship.
- Find an in-depth study lab to apply for;
This is my proposed road map, do not know whether it is clear enough?
Q : Hello Professor, the Blue Brain team of researchers are trying to build a brain that can be thought through reverse engineering of the human brain. I heard that Professor Hinton attacked the idea in a speech. This gives me an impression that Professor Hinton thinks that the machine learning approach is more likely to create a real universal AI.
Let's pretend that sometime in the future, we've created real artificial intelligence, and through Turing tests, it's alive and conscious. If we can see its backstage code, do you think it is the human brain reverse engineering to create it, or the majority of man-made ingredients?
A: I don't think Professor Hinton is really attacking the reverse engineering of the brain, which is that he has no objection to learning how to build intelligent machines from the human brain. I suspect he might be questioning the project itself, which is trying to get more physical details of the brain, without a global computational theory that explains how calculations in the brain work and work (especially from the perspective of machine learning). I remember the analogy he once made: imagine that we have copied all the details of the car intact and plugged in the key, expecting the car to move on its own, and it will not succeed at all. We must know what the meaning of these details is.
Q: does anyone apply deep learning to machine translation? How do you think a neural network-based approach can replace a probability-based approach in a business machine translation system?
A: I have just opened a document that lists some neural network papers for machine translation. Simply put, since neural networks have already won the N-grams from the language model, you can use them first instead of the language model part of machine translation. Then you can use them instead of the translation table (it's just another conditional probability table, after all). A lot of interesting work is being done. The most ambitious and exciting is to completely abandon the current machine translation pipeline method, directly from the depth model to learn a translation model from beginning to end. The interesting thing here is that the output is structured (a joint distribution of a word sequence), and not simply a point prediction (because there is a lot of translation possibilities for an original sentence).
Web has additional information: The New York Times has an article on the English to Mandarin, from the Microsoft produced.
Q : Hello professor, I have used most of the decision tree and random forest in various projects. Can you tell me about the benefits of deep learning?
A: I have written an article explaining why decision trees are less general in nature. The central problem here is that the decision tree (and other machine learning algorithms) divides the input space, and each region allocates independent parameters. As a result, the effect of the algorithm becomes worse for new regions and for cross-region situations. You can't learn a function that can cover more independent areas than the training data. The neural network does not have this problem and has global characteristics because its parameters can be shared by multiple regions.
Q : In the field of deep learning, do you have any good books or papers to recommend?
A: There are so many good articles, there is a reading list for new students in our group.
Q : Will today's machine learning technology be the cornerstone of tomorrow's AI? Where is the greatest difficulty in developing AI? Is it a matter of hardware or software algorithms? What do you think about Ray Kurzweil ' prophecy 2029 machine will pass Turing test? He also wrote a betting article.
A: I can't say that 2029 machines will pass Turing tests, but I can be sure that machine learning will be the core technology for developing future AI.
The biggest problem in the development of AI is the improvement of machine learning algorithms. To get a good enough machine learning algorithm, there are many difficulties, such as computational power, such as conceptual understanding. For example, learn some joint probabilities. I think we are still floating on the surface of the optimization problem of training super-large-scale neural networks. Then the enhancement study, very useful, needs to be improved. You can see the recent work of DeepMind company, they use neural network to automate the 80 's Atari game, very interesting. The article was published at the Nips seminar of my organization.
Q : What do you think of Jeff Hawkins's criticism of deep learning? Hawkins is the author of the book on Intelligence, published in 2004, about how the brain works and how to reference the brain to make intelligent machines. He claims that deep learning does not model time series. The human brain is based on a series of sensor data to think about, people's learning is mainly in the memory of the sequence pattern, such as you see a funny cat video, actually is the cat action makes you laugh, rather than the static picture used by Google. See this link
A: Time-dependent neural networks actually have a lot of work, the recursive neural network model for temporal relationship implicit modeling, usually applied to speech recognition. For example, the following two jobs.
[1]http://www.cs.toronto.edu/~hinton/absps/rnn13.pdf
[2] Http://papers.nips.cc/paper/5166-training-and-analysing-deep-recurrent-neural-networks.pdf
And this article: http://arxiv.org/abs/1312.6026.
The sequence in natural language processing is also considered: http://arxiv.org/abs/1306.2795
Q : In what areas is deep learning promising? What area is its weakness? Why is the stack-type RBM good? Can the principle be explained clearly? Or is it still like a magic black box? What is the connection between aggregation learning and deep learning?
Answer: Not a magic black box at all. I believe I have given a stack of RBM or why an automatic encoder is an effective explanation. See my article with Courville and Vincent: http://arxiv.org/abs/1206.5538
In addition to the interpretation of dropout technology, I do not know the relationship between aggregation learning and deep learning, can refer to this article: http://arxiv.org/abs/1312.6197
Q : According to my understanding, the success of Deep neural network training is related to choosing the right hyper-parameters, such as network depth, hidden layer size, sparse constraint value and so on. Some papers look for these parameters based on random search. It may also have something to do with good code writing. Is there a place where researchers can find the right parameters for specific tasks? On the basis of these parameters, it may be easier to find more optimized parameters.
A: You can see the above section on hyper-parameters. James Bergstra continues this part of the work. I think there is a database that stores many of the recommended hyper-parameter settings and is very beneficial for neural network training. GitHub above the Hyperopt project, did something like that. The HYPEROPT project focuses on neural networks, convolutional networks, and gives recommendations for some hyper-parameter settings. is given in the form of a simple factor distribution. For example, the number of hidden layers should be 1 to 3, and the number of hidden units per layer should be 50 to 5000. In fact, there are many super parameters, as well as a better super-parametric search algorithm and so on. The following are more reference papers:
http://arxiv.org/abs/1306.2795
http://arxiv.org/abs/1312.6026
http://arxiv.org/abs/1308.0850
Http://papers.nips.cc/paper/5166-training-and-analysing-deep-recurrent-neural-networks.pdf
Q : Are there any applications where traditional machine learning methods have failed and deep learning has been successful?
A: There is a constructed application, composed of two simple tasks (object detection, logical reasoning), the application focuses on the implicit variable intrinsic expression, the traditional black box machine learning algorithm has failed, some deep learning algorithm results are good, but also have deep learning algorithm failed. You can read this article. The interesting thing about this app is that it's a lot more complicated than either of those two tasks.
Q : Professor Bengio, in deep learning, there is a class of methods that employ relatively advanced mathematics such as algebra and topological sets. John Healy a few years ago claiming to have improved the neural network (ART1) through the category theory. What do you think about such an attempt? Is it a joke or a promising one?
A: Can look at the work of Morton and Montufar, refer to additional materials:
http://www.ece.unm.edu/~mjhealy/
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.98.6807
Tropical geometry in tropical geometry and probabilistic models
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.242.9890
Q : Professor Bengio, I am about to complete my PhD in Computational Neuroscience, and I am very interested in the "grey areas" of neuroscience and machine learning. What parts of brain science do you think are related to machine learning? What do you want to know about brain science?
A: I think that understanding the computational process of the brain is strongly correlated with machine learning. We do not yet know the working mechanism of the brain, and its efficient learning model will be of great significance to us in designing and implementing artificial neural networks, so this is a very important and cross-section of the Machine learning field and brain science.
English Original:http://www.reddit.com/r/MachineLearning/comments/1ysry1/ama_yoshua_bengio/
Thank Bao for the review of this article.
This article reprinted from: Info
Related article: Dialogue machine learning Great God Yoshua Bengio (UP)
Dialogue machine learning Great God Yoshua Bengio (Next)