David Silver Intensive Learning 4:model-free prediction__reinforcement

Source: Internet
Author: User
First, introduce

Both value iterations and policy iterations are controlled by DP in the context of the known MDP model (i.e., the dynamic transfer matrix P and reward R). So if you don't understand these attributes of the model, how do you predict and control them?

This section focuses on a number of methods for policy evaluation (Model-free policy evaluation) without a model. Second, Monte-carlo RL method

Episodic MDP: All sequences of behavior are terminated in finite steps.

The MC method samples from an existing complete experience fragment (complete episode of Experience) or history, and these fragments are based on the strategy π, then the MC method uses the empirical average return to replace the expected return. This method is used to estimate the state value function vπvπ under the strategy π. First-visit MC Policy Evaluation only calculates the average return value for the first time each fragment reaches a state. Every-visit MC Policy Evaluation, calculates the return value of each fragment each time it reaches a state.

An incremental mean-value formula, μk=μk−1+1k (xk−μk−1) μk=μk−1+1k (xk−μk−1).

Therefore, the Mc-policy-evaluation update formula can be expressed as (α in this formula is a variant of the 1k1k, which can be seen as a attenuation factor), and the method under this formula is called INCREMENTAL-MC:V (ST) ←v (ST) Kit α (GT−V ( ST) V (ST) ←v (ST) Kit α (Gt−v (ST)) III, Temporal-difference Learning

TD is also a Model-free method, but it can learn from incomplete fragments .

The simplest TD method, TD (0), is described below.

Compared with INCREMENTAL-MC, it turns the actual return GTGT in the formula into an estimated return rt+1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.