First of all, recall the previous content:
Bellman Expectation equation:
"Calculations are often used"
"Calculations are often used"
"Calculations are often used"
Bellman optimality equation:
Why DP can solve MDP problems:
Policy Evaluation:
1) To solve the problem:
In other words, to evaluate the optimal value of a policyπ in each state vπ. Usually becomes also prediction tasks. This kind of task is given by the policyπ, the typical DP can solve the task.
2 Solution:
The final equation can be obtained by iterative application of Bellman expectation vπ.
3 An example (can calculate the number in the table, the basic principle is clear):
For example, the above 1.7 is calculated as follows: 0.25*[(-1) + (-1)]*3+0.25*[(-1) +0]=-1.75.
For example, the following 2.4 is calculated as follows: 0.25*[(-1) + (-2)]*2+0.25*[(-1) +0]+0.25*[(-1) + (-1.7)]=-2.425.
Policy Iteration
1) To solve the problem:
In other words, no specific policyπ is given, only the MDP is given, the optimal policyπ* for the MDP is found, and the π* is given the optimal value vπ* in each state.
As can be seen from the above description, Policy evaluation is a child problem of Policy iteration.
2 Solution:
Give the simplest policy (such as the random policy above), and then evaluation the policy (this step takes many rounds) and policy improvement (for example, the second column above, greedy Policy).
Then iterate over the process above, that is, iterate policy evaluation/policy improvement.
(Policy evaluation need a lot of wheels to stabilize, if only policy evaluation, you must wait for vπ stability, but here, you may only need to take a few steps, such as the second column above, evaluation cycle to k=3 after doing policy Improvement, we can get the best policyπ*. But in practice often do not know evaluation should circulate a few rounds, this can first cycle 5 or 10 rounds of evaluation, and then carry out policy improvement, after the improved policy again 5 or 10 rounds of evaluation , and then proceed to policy improvement. )
3) Examples:
The second column above.
Value Iteration
1) To solve the problem:
Like policy iteration, only MDP are given, and the optimal policyπ* for the MDP is found, and the π* is given the optimal value vπ* in each state.
2 Solution:
You can follow the policy iteration method, but only round the policy evaluation phase, then immediately policy improvement "This scheme is value iteration".
Using the partial order relationship of optimal policy:
Thus, as soon as we know the solution to Sub-problems v∗ (s), the solution of v∗ (s) can be obtained by one-step Look-ahead:
The idea of value iteration are to apply these updates iteratively:
3) Benefits:
More intuitive: Intuition:start with final rewards and work backwards
Still works with loopy, stochastic MDPs
Unlike policy iteration, there is no explicit policy (intermediate value functions may not correspond to any policy)
4 Disadvantage: Because the start with final rewards and work backwards, and can only look-ahead one step at a time, so intuitively, the number of cycles will be very large to converge to the optimal strategy.
4) Examples:
Finally, the summary of synchronous Dynamic programming algorithms is given:
Extensions to Dynamic programming
1) Asynchronous Dynamic programming
In-place Dynamic Programming
Prioritised sweeping
Real-time dynamic programming
2) Sample Backups
DP uses full-width backups, DP is effective for medium-sized problems (millions of States)
Every successor State and action is considered
Using knowledge of the MDP transitions and reward function
Using Sample Rewards and sample transitions <s, A, R, S ':
Advantages:
Model-free:no advance knowledge of MDP required
Breaks The curse of dimensionality through sampling
Cost ' backup is constant, independent of n= | s|
3) Approximate Dynamic programming
Note that the convergence of all the above methods has been proven, and not given here.
How do we know this value iteration converges to v∗?
Or that iterative policy evaluation converges to vπ?
and therefore that policy iteration converges to v∗?
Is the solution unique?
How fast does these algorithms converge?
These questions are resolved by contraction mapping theorem