Intensive learning Notes 4. Reinforcement learning method without model-Monte Carlo algorithm

Source: Internet
Author: User
"Learn the basics of learning in simplified learning notes" 4. Reinforcement learning method without model-Monte Carlo algorithm

Explain again what is no model. No model is the state transfer function, the return function does not know the situation.
In the model-based dynamic programming method, which is based on model, including the strategy iteration method and the value function iterative method, it can be unified to the generalized strategy iterative method, that is, the strategy evaluation (computational value function) is performed, and then the strategy is improved based on the base function.

The nature of state-valued functions and state-behavior value functions is to expect that the previous dynamic programming method can calculate expectations by model, and in the case of no model, the expectation can be estimated by empirical averaging, and Monte Carlo method can be used. Because it is an empirical average estimate, it is necessary to ensure that each state can be accessed, and the exploratory initialization method is described here:
1, initialization of all States, value function initialization
2, randomly select a state and a behavior in that state, with a strategy (action strategy) to generate experimental data, for each state that appears in the experiment-behavioral pairs and the return of the subordinate, in an incremental average way, plus to the corresponding state-behavior value function.
3, strategy improvement with greedy strategy (improvement strategy)
4, repeat 2,3

Note the increment average in the way: VK (s) =vk−1 (s) +1k (Gk (s) −vk−1) v K (s) = v k−1 (s) + 1 K (G K (s) −v k−1) V_k (s) =v_{k-1} (s) +\frac{1 }{k} (G_k (s)-v_{k-1})

If the action strategy and the improvement strategy are the same strategy, called the same strategy, otherwise called the different strategy, the two strategies in the different strategy need to be satisfied: The action policy contains or overrides the improvement strategy, Therefore, the distribution of two strategies in an asynchronous strategy is different (i.e., the trajectory probability distribution of the action strategy is different from that of the improved strategy), and the value function needs to be updated using weighted importance sampling.

Weighted importance sampling "not yet figured out, look at the code."

Gym implementation of a model-free reinforcement learning Method-Monte Carlo algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.