First, introduce
Both value iterations and policy iterations are controlled by DP in the context of the known MDP model (i.e., the dynamic transfer matrix P and reward R). So if you don't understand these attributes of the model, how do you predict and control them?
This section focuses on a number of methods for policy evaluation (Model-free policy evaluation) without a model. Second, Monte-carlo RL method
Episodic MDP: All sequences of behavior are terminated in finite steps.
The MC method samples from an existing complete experience fragment (complete episode of Experience) or history, and these fragments are based on the strategy π, then the MC method uses the empirical average return to replace the expected return. This method is used to estimate the state value function vπvπ under the strategy π. First-visit MC Policy Evaluation only calculates the average return value for the first time each fragment reaches a state. Every-visit MC Policy Evaluation, calculates the return value of each fragment each time it reaches a state.
An incremental mean-value formula, μk=μk−1+1k (xk−μk−1) μk=μk−1+1k (xk−μk−1).
Therefore, the Mc-policy-evaluation update formula can be expressed as (α in this formula is a variant of the 1k1k, which can be seen as a attenuation factor), and the method under this formula is called INCREMENTAL-MC:V (ST) ←v (ST) Kit α (GT−V ( ST) V (ST) ←v (ST) Kit α (Gt−v (ST)) III, Temporal-difference Learning
TD is also a Model-free method, but it can learn from incomplete fragments .
The simplest TD method, TD (0), is described below.
Compared with INCREMENTAL-MC, it turns the actual return GTGT in the formula into an estimated return rt+1