Reinforcement learning, Enhanced learning: Policy evaluation,policy iteration,value iteration,dynamic Programming f_reinforcement

Source: Internet
Author: User
Tags rounds

First of all, recall the previous content:

Bellman Expectation equation:
"Calculations are often used"

"Calculations are often used"

"Calculations are often used"

Bellman optimality equation:

Why DP can solve MDP problems:

Policy Evaluation:

1) To solve the problem:

In other words, to evaluate the optimal value of a policyπ in each state vπ. Usually becomes also prediction tasks. This kind of task is given by the policyπ, the typical DP can solve the task.

2 Solution:

The final equation can be obtained by iterative application of Bellman expectation vπ.

3 An example (can calculate the number in the table, the basic principle is clear):

For example, the above 1.7 is calculated as follows: 0.25*[(-1) + (-1)]*3+0.25*[(-1) +0]=-1.75.

For example, the following 2.4 is calculated as follows: 0.25*[(-1) + (-2)]*2+0.25*[(-1) +0]+0.25*[(-1) + (-1.7)]=-2.425.

Policy Iteration

1) To solve the problem:

In other words, no specific policyπ is given, only the MDP is given, the optimal policyπ* for the MDP is found, and the π* is given the optimal value vπ* in each state.

As can be seen from the above description, Policy evaluation is a child problem of Policy iteration.

2 Solution:

Give the simplest policy (such as the random policy above), and then evaluation the policy (this step takes many rounds) and policy improvement (for example, the second column above, greedy Policy).

Then iterate over the process above, that is, iterate policy evaluation/policy improvement.

(Policy evaluation need a lot of wheels to stabilize, if only policy evaluation, you must wait for vπ stability, but here, you may only need to take a few steps, such as the second column above, evaluation cycle to k=3 after doing policy Improvement, we can get the best policyπ*. But in practice often do not know evaluation should circulate a few rounds, this can first cycle 5 or 10 rounds of evaluation, and then carry out policy improvement, after the improved policy again 5 or 10 rounds of evaluation , and then proceed to policy improvement. )

3) Examples:

The second column above.

Value Iteration

1) To solve the problem:

Like policy iteration, only MDP are given, and the optimal policyπ* for the MDP is found, and the π* is given the optimal value vπ* in each state.

2 Solution:

You can follow the policy iteration method, but only round the policy evaluation phase, then immediately policy improvement "This scheme is value iteration".

Using the partial order relationship of optimal policy:

Thus, as soon as we know the solution to Sub-problems v∗ (s), the solution of v∗ (s) can be obtained by one-step Look-ahead:

The idea of value iteration are to apply these updates iteratively:

3) Benefits:

More intuitive: Intuition:start with final rewards and work backwards

Still works with loopy, stochastic MDPs

Unlike policy iteration, there is no explicit policy (intermediate value functions may not correspond to any policy)

4 Disadvantage: Because the start with final rewards and work backwards, and can only look-ahead one step at a time, so intuitively, the number of cycles will be very large to converge to the optimal strategy.

4) Examples:

Finally, the summary of synchronous Dynamic programming algorithms is given:

Extensions to Dynamic programming

1) Asynchronous Dynamic programming

In-place Dynamic Programming
Prioritised sweeping
Real-time dynamic programming

2) Sample Backups

DP uses full-width backups, DP is effective for medium-sized problems (millions of States)
Every successor State and action is considered
Using knowledge of the MDP transitions and reward function

Using Sample Rewards and sample transitions <s, A, R, S ':
Model-free:no advance knowledge of MDP required
Breaks The curse of dimensionality through sampling
Cost ' backup is constant, independent of n= | s|

3) Approximate Dynamic programming

Note that the convergence of all the above methods has been proven, and not given here.

How do we know this value iteration converges to v∗?
Or that iterative policy evaluation converges to vπ?
and therefore that policy iteration converges to v∗?
Is the solution unique?
How fast does these algorithms converge?
These questions are resolved by contraction mapping theorem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.