Deep reinforcement learning with Double q-learning
Google DeepMind
Abstract
The mainstream q-learning algorithm is too high to estimate the action value under certain conditions. In fact, it was not known whether such overestimation was common, detrimental to performance, and whether it could be organized from the main body. This article answers the above questions, in particular, this article points out that the recent DQN algorithm, does exist in playing Atari 2600 will suffer from substantial overestimations. This paper puts forward that double q-learning algorithm can reduce the observed overestimation problem very well, and has achieved better results in several games.
Introduction
The goal of reinforcement learning is to learn a good strategy for sequence decision problems by optimizing a cumulative future reward signal. Q-learning is one of the most famous RL learning algorithms, but because it includes a maximized step when predicting action values, it results in a high predictive value, which makes it possible to learn an unrealistic higher action value.
In previous work, Overestimation's problems were attributed to inflexible function estimates and noise. This article unifies these viewpoints and shows that when the action value is not accurate, the overestimation will appear, regardless of the source of the estimated error. Of course, in the course of learning, the occurrence of inaccurate value estimates is also normal, which also shows that overestimation may be more common than previously seen in the situation.
If Overestimation does appear, then this open problem does affect actual performance. An overly optimized value estimation is not necessary in a problem, and if all values are preserved more evenly than the relative action reference, then we will not believe that the resulting strategy will be worse. In addition, sometimes optimistic is a good thing: optimistic in the face of uncertainty is a well-known exploration technique. However, if they are predicted and even, not focused on the state, they may have a bad effect on the resultant strategy. Thrun and others give specific examples, namely: The strategy that leads to suboptimal.
To test whether overestimation is actually occurring, we explored the performance of the recent DQN algorithm. About DQN can refer to related articles, here do not repeat. It may be strange that this dqn setting still has an excessively high estimate of the action of value in this case.
The authors show that the idea behind the Double q-learning algorithm can be well combined with arbitrary function estimation, including neural networks, which we use to construct a new algorithm called Double DQN. This algorithm can not only produce more accurate value estimation, but also get higher scores in several games. This shows that there are indeed overestimation problems on DQN, and it is better to reduce or eliminate them.
Background
In order to solve the problem of sequence decision, we learn the estimate of the optimal value of each action, which is defined as: when the action is taken and the optimal strategy is adopted later, the sum of the expected future rewards is obtained. After given a policy $\pi$, the real value of an action a under State S is:
$Q _{\pi} (S, a) = e[r_1 + \gamma r_2 + ... | S_0 =s, a_0 = A, \pi]$,
The optimal value is $Q _* (s, a) = Max_{\pi} Q_{\pi} (S, a) $. An optimized strategy is to select the highest value action from each state.
The q-learning algorithm can be used to predict the optimal action value. Most interesting questions are not able to calculate their action values in all States. Instead, we learn a parameterized action function Q (s, A; \theta_t). Under the State St, the action is taken $A _t$ after the standard q-learning is updated, and then the reward $R _{t+1}$ and the converted state is observed $S _{t+1}$:
Where the target $Y _t^q$ is defined as:
This update is very similar to the random gradient descent, toward target value $Y ^q_t$ update the current value Q (s_t, a_t; \theta_t).
Deep Q-networks.
A DQN is a multilayer neural network, given a state s, that outputs a vector of action values $Q (s, *; \theta) $, where $\theta$ is the parameter of the network. For an n-dimensional state space, the action space is M-action, and the neural network is a function that maps it from n-dimensional space to M-dimensional. Two important points are the use of the target network and the use of the experience replay. The target network, with the parameters $\theta^-$, is the same as the online network, except that the parameters are copied from online networks after some steps. The target network is:
For experience replay, the observed transitions are stored and randomly sampled from it to update the network. Both target network and experience replay significantly improved the final performance.
Double q-learning
On the standard q-learning and Max operator on DQN, use the same values to select and evaluate an action. This makes it more inclined to choose overestimated values, resulting in suboptimal estimates. In order to prevent this phenomenon, we can choose from the evaluation of Independent, this is the Double q-learning behind the idea.
At the very beginning of the Double q-learning algorithm, by randomly assigning each experience to update one of two value functions to learn two value function, so that two weights are obtained, $\theta$ As well as $\theta ' $. For each update, one of the weights is used to determine the greedy policy and the other to determine its value. To make a clear comparison, we can first resolve selection and evaluation, rewrite equation 2, get:
Then, Double q-learning error can be written as:
Notice the choice of action, in Argmax, that still belongs to online weights $\theta_t$. This means that, like q-learning, we can still use the greedy strategy to estimate value based on the current value. However, we use the second weight $\theta _t ' $ to evaluate the strategy more equitably. A collection of second weights that can be updated by swapping the roles of two weights.
overoptimism due to estimation errors:
Paper notes: Deep reinforcement learning with Double q-learning