Topic Center

Contact Sales

Home > Others

Intensive learning Notes 4. Reinforcement learning method without model-Monte Carlo algorithm

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

"Learn the basics of learning in simplified learning notes" 4. Reinforcement learning method without model-Monte Carlo algorithm

Explain again what is no model. No model is the state transfer function, the return function does not know the situation.
In the model-based dynamic programming method, which is based on model, including the strategy iteration method and the value function iterative method, it can be unified to the generalized strategy iterative method, that is, the strategy evaluation (computational value function) is performed, and then the strategy is improved based on the base function.

The nature of state-valued functions and state-behavior value functions is to expect that the previous dynamic programming method can calculate expectations by model, and in the case of no model, the expectation can be estimated by empirical averaging, and Monte Carlo method can be used. Because it is an empirical average estimate, it is necessary to ensure that each state can be accessed, and the exploratory initialization method is described here:
1, initialization of all States, value function initialization
2, randomly select a state and a behavior in that state, with a strategy (action strategy) to generate experimental data, for each state that appears in the experiment-behavioral pairs and the return of the subordinate, in an incremental average way, plus to the corresponding state-behavior value function.
3, strategy improvement with greedy strategy (improvement strategy)
4, repeat 2,3

Note the increment average in the way: VK (s) =vk−1 (s) +1k (Gk (s) −vk−1) v K (s) = v k−1 (s) + 1 K (G K (s) −v k−1) V_k (s) =v_{k-1} (s) +\frac{1 }{k} (G_k (s)-v_{k-1})

If the action strategy and the improvement strategy are the same strategy, called the same strategy, otherwise called the different strategy, the two strategies in the different strategy need to be satisfied: The action policy contains or overrides the improvement strategy, Therefore, the distribution of two strategies in an asynchronous strategy is different (i.e., the trajectory probability distribution of the action strategy is different from that of the improved strategy), and the value function needs to be updated using weighted importance sampling.

Weighted importance sampling "not yet figured out, look at the code."

Gym implementation of a model-free reinforcement learning Method-Monte Carlo algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Intensive learning Notes 4. Reinforcement learning method without model-Monte Carlo algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support