1 Preface
In the previous article, we introduced the two basic algorithms of policy iteration and value iteration based on the Bellman equation, but these two algorithms are actually difficult to apply directly, because the two algorithms are still biased to the idealized one. You need to know the state transition probability, and you need to traverse all the states. For the traversal state, of course, we can not do a full traversal, but only as far as possible through the exploration to the various states. For the state transition probability, which is dependent on the model, this is a difficult thing.
What is a state transition? Like a bullet, if I knew the speed of movement, the current position of the movement, the air resistance, and so on, I could use Newton's law of motion to describe its motion, and then know where the bullet would appear in the next moment. So this is based on Newton's Law of motion to describe its motion is a model, we can know its state (space position, velocity) change probability.
So basically the reinforcement learning problem needs to have a certain model of prior knowledge, at least according to the prior knowledge we can determine how much input can lead to how much output. For example play Atari this game, if the input only half of the screen, then we know that no matter how good the algorithm, can not be trained. Because the input is limited, and even humans cannot do it. But at the same time, humans do not need to know exactly what the exact model should be, and humans can deduce the results exactly according to the observations.
Therefore, for the enhancement of learning problems, or for arbitrary decision-making and control problems. The input and output is determined by the basic model or prior knowledge, and the specific model can be used without consideration. Therefore, in order to better solve the problem of enhancing learning, we are more concerned about model free practice. The simple thing is, if we don't know the probability of state transition at all (just like humans), how do we get the optimal strategy?
This article describes the Monte Carlo method.
2 Monte Carlo method
The Monte Carlo method is only for problems with stage episode. For example, play a game, the next set of chess, there is a step, will end. Some of the problems are not necessarily end, such as the start car, you can open indefinitely, or the need for special long to end. It's a key to be able to end. As long as it can be done, the reward of each step can be determined, that is, the value can be computed. such as chess, the last win is win, lose is lost. We can only estimate value for the problem that cannot be ended.
Then the Monte Carlo approach is only concerned with the problem of being able to end more quickly. The idea of Monte Carlo is simple, that is, repeated testing to find the average. If you know that pitching a ball on the ground to calculate pi is better understood. Not clear children's shoes can look for the Internet. So how do you use it to enhance your learning?
Since every episode can come to an end, it means the following:
Every step of the reward know, also means that each step of return g t can be calculated. That's good. We do tests over and over again so that many States are traversed, and more than once, that each time the return in the state can be summed to the average.
When episode is infinitely large, the resulting data is close to the real data.
The Monte Carlo method is to replace the computational method of the Bellman method with the statistical method.
The algorithm above is called First-visit MC. That is, each time the episode state uses only the first arrival T to calculate the return.
Another method is every-visit, that is, each time the state of the episode is accessed to calculate the return average.
So you can see that the Monte Carlo method is extremely simple. But the drawbacks are also obvious, requiring as many iterations as possible, and it takes a lot of time to calculate after each test. But, do you know? Alphago is the idea of using Monte Carlo. Not a Monte-Carlo tree search, but rather the idea of using Monte Carlo methods in reinforcement learning. Alphago every time to the end of chess, and only use the last win and lose as return. So this is also a very magical thing, only to use the final result of winning and losing, even can optimize each step of the way.
3 Using the Monte Carlo method to control
The Monte Carlo method described above is only able to evaluate the current policy. So you remember the last blog said the policy iteration method? We can evaluate using the Monte Carlo method in policy iteration and then use the greedy policy update.
There are still two ways of doing so. One is to test multiple times under one policy, evaluate completely, then update policy, and then do a lot of testing. The other is the incomplete evaluation, each time the test is completed on the evaluation, after the evaluation of the update:
The first approach:
The second approach:
Both approaches can converge, so the second approach is obviously faster.
So, a little bit more improvement, is changing the greedy policy ? Value , making the constant smaller tends to be 0, the last policy is the complete optimal policy.
This algorithm is called Glie Monte-carlo Control:
Other variants:
Monte Carlo with exploring starts, using q ( s a ) , and then using the second approach described above, the policy is updated once episode, and the policy uses the Q value directly.
The policy update uses the ? g r e e d y , the goal is to be able to better explore the entire state space.
4 Off Policy Learning
Then the above method has been based on the current policy, in order to explore the state space, the use of a suboptimal strategy ? g r e e d y Policy to explore. Is it possible to use the two policy more directly? A policy is used to explore space, behavior policy, and another policy is to achieve the best policy, called Target policy. Then this method is called off policy learning. The method of On-policy is relatively simple, the Off-policy method needs more concepts and tags, it is not easy to understand, and because behaviour policy and target policy are not related, this method is less easily convergent. But Off-policy is more powerful, more generic, and the actual On-policy method is a subset of the Off-policy method. For example, you can use Off-policy to learn an enhanced learning model from a human expert or a traditional control algorithm.
The key is to find the weight relationship between the two policy and update the Q value.
About the part of Off-policy learning, then combine the TD method to do the analysis.
Summary
This blog analyzes the Monte Carlo method. This statistical-based approach is simple, but more can only be used in virtual environments where unlimited testing is possible. And state states are relatively limited, discrete best. Based on this method, such as simple Gobang (the board is the best small), you can use this method to play.
The next blog is about analyzing TD methods.
Statement:
The pictures of this article are captured from:
1 Reinforcement Learning:an Introduction
2 Reinforcement Learning Course by David Silver
Enhanced Learning Reinforcement Learning Classic Algorithm Comb 2: Monte Carlo method