"Learn the basics of learning in simplified learning notes" 4. Reinforcement learning method without model-Monte Carlo algorithm
Explain again what is no model. No model is the state transfer function, the return function does not know the situation.
In the model-based dynamic programming method, which is based on model, including the strategy iteration method and the value function iterative method, it can be unified to the generalized strategy iterative method, that is, the strategy evaluation (computational value function) is performed, and then the strategy is improved based on the base function.
The nature of state-valued functions and state-behavior value functions is to expect that the previous dynamic programming method can calculate expectations by model, and in the case of no model, the expectation can be estimated by empirical averaging, and Monte Carlo method can be used. Because it is an empirical average estimate, it is necessary to ensure that each state can be accessed, and the exploratory initialization method is described here:
1, initialization of all States, value function initialization
2, randomly select a state and a behavior in that state, with a strategy (action strategy) to generate experimental data, for each state that appears in the experiment-behavioral pairs and the return of the subordinate, in an incremental average way, plus to the corresponding state-behavior value function.
3, strategy improvement with greedy strategy (improvement strategy)
4, repeat 2,3
Note the increment average in the way: VK (s) =vk−1 (s) +1k (Gk (s) −vk−1) v K (s) = v k−1 (s) + 1 K (G K (s) −v k−1) V_k (s) =v_{k-1} (s) +\frac{1 }{k} (G_k (s)-v_{k-1})
If the action strategy and the improvement strategy are the same strategy, called the same strategy, otherwise called the different strategy, the two strategies in the different strategy need to be satisfied: The action policy contains or overrides the improvement strategy, Therefore, the distribution of two strategies in an asynchronous strategy is different (i.e., the trajectory probability distribution of the action strategy is different from that of the improved strategy), and the value function needs to be updated using weighted importance sampling.
Weighted importance sampling "not yet figured out, look at the code."
Gym implementation of a model-free reinforcement learning Method-Monte Carlo algorithm