Whether to go into the pit of "deep reinforcement learning", read this paper to say!

Last Update:2018-09-29 Source: Internet

Author: User

Tags random seed

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today in school again Double 叒 Corporation mentioned deep reinforcement learning that Matters this post of DRL (deeper reinforcement learning, intensive learning) dismissal of the first gun article, Back after a long time to brush a bit of Twitter, see this text deep reinforcement learning doesn ' t work yet, or can be literal translation for the depth of the study still play or free translation for deep reinforcement learning far can not Plug and play.

After reading a lot from his last July into the pit since the vague confusion has been answered. After reading a lot of this article, as long as the long night feeling. Heart excited, temporarily difficult to self-restraint. Knowledge of the depth of the content of intensive learning is relatively small, the best private thought is the smart unit, in addition to a lot of scattered paper introduction, course notes, questions and answers, but it seems that no one mentioned this article. This article is I have seen since the depth of the deep reinforcement learning aspects of the best stage summary, strongly recommended should be as a deep reinforcement of the first lesson, after reading, we will carefully consider whether or not into the pit.

First look at the author's background. The author is Alex Irpan, now a software engineer for the Google Brain Robotics team. He received his undergraduate degree in computer Science from Berkeley, where he studied at the Berkeley AI Lab (Berkeley AI Research (BAIR) lab) and was DRL Daniel Pieter Abbeel, who also worked with John Schulman A

This article comes up with the point that deep reinforcement learning is a big hole. Its success story is very few, but each is too famous, for example, with the deep Q Network (DQN) in the Atari games with the original pixel image as a state to achieve or exceed the performance of human experts, through the left and right mutual beats (Self-play) and other ways on the go to crush the human, Greatly reduced the energy consumption of the Google Power Center and so on. The result is that researchers who have not engaged in deep reinforcement learning have a great illusion, overestimated its ability and underestimated its difficulty.

Reinforcement learning itself is a very generic AI paradigm, intuitively making it ideal for simulating a variety of time-series decision tasks, such as voice, text-like tasks. When it and the deep neural network This gives me enough layers and enough neurons to approximate the nonlinear function approximation model of any function together feel like heaven, no wonder DeepMind often known as Ai = deep learning + intensive learning.

But Alex told us not to worry, let's look at some questions first:

1. its sample utilization is very low. in other words, in order to achieve a certain height of the model requires a very large number of training samples.

2. The final performance is often not good enough. There are many tasks where non-intensive learning and even non-learning methods such as model-based control (mode based control), linear two-time regulator (Linear quadratic regulator), etc., can achieve much better performance. The most irritating is that these models often have high sample utilization rates. Of course, these models sometimes have assumptions such as a well-trained model that can be imitated, such as a Monte Carlo tree search, and so on.

3. Thekey to DRL success is a good reward function (reward function), which is often difficult to design. in the deep reinforcement learning that Matters the author mentions that sometimes multiplying the reward by a constant model shows the difference between the heavens and the ground. But the reward function of the pit father is more than that. The design of the reward function needs to be guaranteed:

The appropriate priori is added, which defines the problem and the corresponding action in all possible states. The pit Daddy is the model many times will find cheat means. One example of Alex is that there is a task to put the red Lego bricks above the blue Lego bricks, and the value of the reward function is based on the height of the red Lego bricks at the bottom. The result is a model that directly turns the red Lego bricks upside down. Boy, you are bad, Abba is very disappointed in you ah.
The value of the reward function is too sparse. In other words, in most cases, the reward function returns a value of 0 in a state. This is with our people to learn also need encouragement, learn too long no return is easy to be discouraged. It is said that 21st century is a century of biology, how can I not feel it? It's just begun in 21st century. I can't wait.
There are times when it's too much to do with bonus functions to introduce new biases (bias).
To find a reward function that everyone uses and has a good nature. Alex did not discuss it very deeply here, but it was linked to a blog from Tao (Terence Tao), and everyone was interested to see it.

4. improper application of local optimization/exploration and exploitation (exploration vs. exploitation). Alex cited an example is a continuous control of the environment, a horse-like four-legged robot in the running, the results of the model is not careful to see the horse splayed a good result of the situation, so you can only see splayed horse.

5. overfitting of the environment. DRL rarely play around in many environments. You trained dqn to work on a Atari game, for one might be completely off the job. Even if you want to do migration study, there is no guarantee that you will succeed.

6. instability.

When you read the DRL paper, you'll find that sometimes the authors give a model that shows that almost all of the model performance will eventually drop to 0 as the number of random seed drops is attempted. In contrast, the different hyper-parameters in the supervised learning show more or less the change of training, and the bad luck in DRL may be a long time without any change in the curve of your model, because it is completely non-work.
Even if you know about hyper-parameters and random seeds, your implementation can vary greatly as long as there is a slight difference in the performance. This may be the reason why the same algorithm behaves differently on the same task in the reinforcement learning that Matters, John Schulman two different articles.
Even if everything is going well, from my personal experience and previous conversations with a DRL researcher, as long as the time is long your model performance can suddenly become completely non-working. The reason I am not entirely sure, may be related to overfitting and variance too large.

The 6th above, in particular, is almost catastrophic. The author mentions that his internship began with the realization of normalized Advantage Function (NAF), in order to find out the bugs of Theano itself took six weeks, this is the result of the NAF author in the case of his next to his harassment. The reason is that the DRL algorithm many times in the case of not looking for a good parameter is not work, so you can hardly judge your code in the end there is no bug or bad luck.

The authors also reviewed the success stories of DRL, who believed that DRL's success stories were very small, including:

All kinds of games: Atari game, Alpha Go/alpha zero/dota2 1v1/Super Mario/Japan will be chess, in fact, there should be DRL the earliest success case, 93 of the western Double Backgammon (backgammon).
DeepMind's running cool robot.
Save energy for Google's energy center.
Google's automl.

The authors believe that the lessons learned from these cases are that DRL may be more likely to perform well with the following conditions, the more the better:

Data acquisition is very easy and very cheap.
Do not rush to the beginning of the difficult, you can start from the simplification of the problem.
You can do the right and left pacing.
Bonus functions are easy to define.
Reward signal is very much, feedback timely.

He also points out some potential future developments and possibilities:

local optimality may be good enough. Some future studies may point out that we do not have to worry too much about local optimality in most cases. Because they are not much worse than the global best.
Hardware is king. in the case of strong enough hardware, we may not be so concerned about the sample utilization, everything hard just can be good enough performance. Various genetic algorithms play up.
artificially add some supervisory signals. you can introduce self-motivation (intrinsic reward) or add some ancillary tasks, such as DeepMind, when the frequency of environmental incentives is too low, like this one, and wrote a reinforcement learning with Unsupervised auxiliary Tasks (https://arxiv.org/abs/1611.05397). LeCun not too little cherry on the cake, let's give him some more cherry!
more integration of model-based learning to increase sample usage. There have been a lot of attempts in this area, specifically to look at the jobs Alex has mentioned. But it's still far from ripe enough.
just use DRL for fine-tuning. For example, the first Alpha Go is supervised learning, supplemented by intensive learning.
Learn the bonus function automatically. this involves inverse reinforcement learning and imitation learning.
A further combination of migration learning and intensive learning.
A good priori.
sometimes complex tasks are easier to learn. The example Alex mentions is that DeepMind often likes to let models learn many variants of the same environment to reduce overfitting to the environment. I think this also involves curriculum learning, that is, starting from a simple task to gradually deepen the difficulty. Can be said to be a layer of progressive migration learning. Another possible explanation is that many times people feel that difficult tasks are the opposite of tasks that the machine finds difficult. For example, people feel that pouring water is very simple, you let the robot use learning path to learn to pour water can be difficult. But in turn people feel that the next go is very simple and machine learning model is in the next game to beat people.

In the end, Alex was very optimistic overall. He said that although there are many difficulties, so that DRL may not be a strong (robust) to everyone can easily join the field of research and many times some problems with DRL far from supervised learning simple and good performance, but perhaps a few years you come back DRL work is also unknown AH. This is still very encouraging. Tanaka also expressed similar ideas and felt that there were many opportunities because the field was not mature enough. They are all great researchers.

I am generally very excited to see this article. But to tell the truth there are some regrets, if last summer there is this article, perhaps I will be careful to consider whether in the laboratory did not accumulate themselves and leave the graduation and application not far from the situation began such a topic. This is a lesson, is to start a field to have a good understanding of this area, before fragmented on the internet has a little relevant voice, such as Karpathy mentioned that he in the implementation of vanilla policy gradient also encountered a lot of difficulties.

If It makes you feel any better, I ' ve been doing the-a while and it took me last ~ weeks to get a from-scratch polic Y gradients implementation to work 50% of the time on a bunch of RL problems. And I also has a GPU cluster available to me, and a number of friends I get lunch with every day who's ve been in the area For the last few years.

Also, what are we know about good CNN design from supervised learning land doesn ' t seem to apply to reinforcement learning LAN D, because you ' re mostly bottlenecked by credits assignment/supervision bitrate, not by a lack of a powerful representati On. Your resnets, batchnorms, or very deep networks has no power here.

[Supervised learning] wants to work. Even if you screw something up you'll usually get something non-random back. RL must is forced to work. If you screw something up or don ' t tune something well enough you ' re exceedingly likely to get a policy that's even worse than random. And even if it's all well tuned you'll get a bad policy 30% of the time, just because.

Long story short Your failure are more due to the difficulty of deep RL, and much less due to the difficulty of "designing Neural Networks ".

Source: https://news.ycombinator.com/item?id=13519044

But I didn't notice it at first. In fact, the tutor has always been mentioned that he felt that my project is more risky, especially he felt that now in addition to Berkeley, openai,deepmind there are few DRL do good laboratory, which in itself indicates that this direction may have some intangible threshold. Now I think that these might include computational resources and equipment (robots), relatively senior researchers who know about trick and pits, and so on. Objectively these places people's comprehensive level and engineering ability is also strong to make people heinous, direct competition is very difficult. Although I am relatively weak, but these for the students who intend to enter the DRL need to consider carefully.

At the end of the day, it was better to push Alex's article, which he listed in a lot of the DRL researchers might have known, but no one had introduced it in such a complete and organized way. For the students who want to do DRL is the gospel. My book is the first time after reading his thoughts and generalizations, for I do not know some places on a pen, or the expression is not accurate. The original text is very long, I am familiar with most of the content of the case to see 1.5 hours, but also very interesting, or strongly recommend.

Finally, this article may have some title party, not really want to completely dismiss everyone, Alex's intention is to look more calmly on the current progress of DRL research, to avoid repeated stepping on the pit. The comment area mentioned that because of the difficulty to do the value, as well as the robot, cybernetics background friends mentioned that he felt DRL can do anything if you have the correct super parameters, these comments are also worth your reference.

---------------------This article from the AI Fan Club csdn blog, full-text address please click: 79385925?utm_source=copy

Whether to go into the pit of "deep reinforcement learning", read this paper to say!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More