Contact Way: 860122112@qq.com
DQN (Deep q-learning) is a mountain of deep reinforcement learning (Deep reinforcement LEARNING,DRL), combining deep learning with intensive learning to achieve from perception (perception) to action (action ) is a new algorithm for End-to-end (end-to-end) learning. Published by DeepMind on Nips 2013, 1, and then on Nature 2015, improved version 2 was proposed. First, DRL
Reason: In the ordinary q-learning, when the state and the action space is discrete and the dimension is not high can use q-table to store each state action pair's Q value, but when the state and the action space is the high dimension continuous, the use q-table is not realistic.
The usual approach is to turn the q-table update problem into a function fitting problem, and similar states get similar output action. In the following formula, the Q function is approximated to the optimal Q value by updating the parameter θ
Q (s,a;θ) ≈q′ (s,a) and deep neural networks can automatically extract complex features, so it is most appropriate to use a deep neural network for high dimensional and continuous state.
DRL is the combination of deep learning (DL) and reinforcement Learning (RL), which is a direct learning control strategy from high-dimensional raw data. And DQN is one of DRL's algorithms, it is to do is to combine the convolution neural network (CNN) and Q-learning, CNN input is the original image data (as state), output is the value of each action actions corresponding evaluation value Function (q value). Second, DL and RL combination of the problem DL needs a large number of labeled samples for supervised learning; The RL only has reward return value, and with the noise, delay (after dozens of milliseconds to return), sparse (many state of the reward is 0) and other issues; DL sample Independent The state of the RL is correlated before and after, the DL target distribution is fixed; the distribution of RL has been changing, for example, you play a game, a level and the next level of the status distribution is different, so the training of the previous checkpoint, the next checkpoint and retraining; past studies have shown that When using nonlinear networks to represent value functions, there are problems such as instability. Third, the DQN solves the problem method through q-learning uses reward to construct the label (corresponding question 1) solves the correlation and the non-static distribution problem by the experience replay (the experience Pool) method (corresponding question 2, 3) Uses a CNN (Mainnet) to generate the current Q value, using another CNN (target) to produce the Target Q value (corresponding to question 4) 1, construction label
As mentioned above, the CNN effect in DQN is to do function fitting in high dimensional and continuous state, but for function optimization, the general method of supervised learning is to determine the loss function, then to find the gradient and to update the parameters with the method of stochastic gradient descent. DQN is based on q-learning to determine the loss Function.
Q-learning
The basics of the RL no longer verbose, just look at the q-learning update formula: q∗ (s,a) =q (s,a) kit α (R+ΓMAXA′Q (s′,a′) −q (s,a))
The loss function of DQN is L (θ) =e[(targetq−q (s,a;θ)) 2]
Where Theta is the network parameter and the target is Targetq=r+γmaxa′q (s′,a′;θ)
It is obvious that the loss function is based on the second term of the Q-learning update formula and that the two formulas have the same meaning, all of which approximate the current Q value to target Q.
Next, L (Theta) is asked to update the network parameter θ using methods such as SGD, for the gradient of θ. 2, Experience pool (experience replay)
The function of experiential pool is mainly to solve the problem of relativity and non-static distribution. The practice is to store the transfer samples (st,at,rt,st+1) that each time step agent interacts with the environment to the playback memory unit, and some (Minibatch) are randomly trained to train. (In fact, the game is to play the process of fragmentation storage, training, random extraction to avoid the correlation problem) 3, the target network
This improvement is presented in the dqn of Nature 2015, which uses another network (called targetnet) to produce target Q values. Specifically, Q (s,a;θi) represents the output of the current network mainnet, which is used to evaluate the value function of the current state action pair, Q (s,a;θ−i) represents the targetnet output and obtains the target Q value in the formula above which the TARGETQ value is obtained. The mainnet parameters are updated according to the loss function above, and the mainnet parameters are copied to targetnet for each N-round iteration.
After introducing the targetnet, the objective Q value is kept unchanged for some time, which decreases the relativity of the current Q value and the target Q value, and improves the stability of the algorithm. Four, DQN algorithm flow 1, network model
Input is the most recent 4-frame 84x84 image processed into grayscale, after several convolution layers (no pool layer) followed by two fully connected layers, output is the Q value of all actions. 2, algorithm pseudo-code
NIPS version 2013
Nature version 2015
2, Algorithm flow chart (2015 edition)
Main flow chart
Construction of Loss Function
V. Summary
DQN is the first to combine the depth learning model with intensive learning to successfully direct the input learning control strategy from high dimension.
Innovation point: Loss function (not very new) based on q-learning structure, which is done when using linear and non-linear functions to fit q-table. The correlation and non-static distribution problems are solved by experience replay (experiential pool), and the stability problem is solved using targetnet.
Advantages: The algorithm versatility, can play different games, end-to-end training methods, can produce a large number of samples for supervision and learning.
Disadvantages: Can not be applied to continuous motion control, only to deal with the problem of short-term memory, can not deal with the need for long memory problems (the follow-up study put forward the use of lstm and other improved methods); CNN does not necessarily converge and requires excellent tuning parameters.
Reference documents
[1] Playing Atari with Deep reinforcement Learning
[2] Human-level control through deep reinforcement learning