1 Preface
In the previous depth Enhancement Study Series, we have analyzed the DQN algorithm in detail, a value based algorithm, then today, we are working with you to analyze another algorithm in depth enhancement learning, that is, based on the policy gradient policy gradient algorithm. The actor-critic algorithm combined with the value based algorithm is the most effective depth-enhanced learning algorithm at present.
Then the study of policy gradient methods, there are some of the following online resources are worth looking at: Andrej karpathy blog:deep reinforcement Learning:pong from Pixels
David Silver ICML 2016: Deep Reinforcement Learning Tutorial
John schulman:machine Learning Summer School
David Silver's Enhanced Learning program (with video and PowerPoint): http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html
So in fact Andrej Karpathy blog has been very detailed analysis of the policy gradient method, here I will synthesize the above content according to my own understanding of the following policy gradient. 2 Why Policy Network?
We already know that DQN is a value based approach. In other words, by calculating the value of each state action, and then choosing the most valuable action to perform. This is an indirect approach. So, what's a more straightforward approach.
Can you update the Strategy network directly policy network?
What is a policy Network Policy network. is a neural network, the input is state, the output is the action (not the Q value) directly.
or the output probability:
Here's a question about the probability output. For DQN, it is essentially an algorithm close to deterministic output. The most is to use for exploration. But there are many times when, in a particular state, the choice of many actions may be possible. For example, I have 20 dollars to buy rice. So whether I buy egg fried rice or potato meat cover yards, the result is the same as the full belly. Therefore, the use of output probability will be more general. and DQN can not output the probability of action, so the use of policy network is a better way. 3 Policy Gradient
To update a policy network, or to use a gradient descent method to update the network, we need to have an objective function. For the strategy network, the objective function is relatively easy to give, is very direct, the final result. Is
Cumulative expectation of all band attenuation reward
So the question is how to use this goal to update the parameters. I look at this loss function and the policy network simply does not have any direct connection, reward is the environment gives, how can update the parameter. Another way of saying this is how to calculate the gradient of a loss function on a parameter (i.e. a policy gradient):
I see at all there is no idea is not, so first change a thought to consider the problem. 4 Give me a policy network, also have no loss, how to update. Change the occurrence probability of the action.
Now we don't think about anything else, we just think about it from the perspective of probability. We have a strategy network, the input state, the probability of the output action. Then we can get reward, or result, after the action is done. So this time, we have a very simple idea: if a certain action gets more reward, then we make the probability of its occurrence increase, if a certain action gets less reward, then we make the probability of its occurrence decrease.
Of course, it is also obvious that the use of reward to judge the quality of the action is inaccurate, and even judged by result is inaccurate. After all, any reward,result is dependent on a lot of action to cause. But this does not prevent us from doing this: if we can construct a good action evaluation index to judge the good and bad of a movement, then we can optimize the strategy by changing the occurrence probability of the action.
Assuming this evaluation metric is, then our policy network output is probability. In general, log likelihood is more commonly used. Reason to look here why we consider log likelihood instead of likelihood in Gaussian distribution.
Therefore, we can construct a loss function as follows:
How to understand it. Give me a simple example of Alphago. For Alphago, F (s,a) is the final result. In other words, if the chess game wins, then every step of the game is considered good, and if it loses, then it is considered chess. The good F (s,a) is 1, the bad is 1. So here, if a is considered good, then the goal is to maximize the probability of the good action, and vice versa.
This is the most basic thought of policy gradient. 5 Another angle: direct counting
F (s,a) can be used as an objective function as well as an evaluation index of the action. Just like Alphago, the evaluation index is to win or lose, and the goal is to win the result. This has no conflict with the goal of the previous analysis. Therefore, we can use the evaluation indicator F (s,a) to optimize the policy, while optimizing the F (s,a) at the same time. Then the problem becomes the gradient for F (s,a) about the parameter. The following formula is directly excerpted from the BLOG,F (x) of Andrej Karpathy, which is f (s,a)
From the conclusion of the formula, we can see that the objective function is consistent with the previous summary analysis.
Therefore, the Policy gradient method is so sure. 6 Summary
This blog as a primer, introduce the basic idea of policy gradient. Then you will find that how to determine this evaluation index is the key to achieve policy gradient method. So, in the next article. We will analyze the problem of this evaluation index in the future.