Reinforcement learning, Enhanced learning: Policy evaluation,policy iteration,value iteration,dynamic Programming f

Reinforcement learning, Enhanced learning: Policy evaluation,policy iteration,value iteration,dynamic Programming f_reinforcement

Last Update:2018-08-22 Source: Internet

Author: User

Tags rounds

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First of all, recall the previous content:

Bellman Expectation equation:
"Calculations are often used"

"Calculations are often used"

Bellman optimality equation:

Why DP can solve MDP problems:

Policy Evaluation:

1) To solve the problem:

In other words, to evaluate the optimal value of a policyπ in each state vπ. Usually becomes also prediction tasks. This kind of task is given by the policyπ, the typical DP can solve the task.

2 Solution:

The final equation can be obtained by iterative application of Bellman expectation vπ.

3 An example (can calculate the number in the table, the basic principle is clear):

For example, the above 1.7 is calculated as follows: 0.25*[(-1) + (-1)]*3+0.25*[(-1) +0]=-1.75.

For example, the following 2.4 is calculated as follows: 0.25*[(-1) + (-2)]*2+0.25*[(-1) +0]+0.25*[(-1) + (-1.7)]=-2.425.

Policy Iteration

1) To solve the problem:

In other words, no specific policyπ is given, only the MDP is given, the optimal policyπ* for the MDP is found, and the π* is given the optimal value vπ* in each state.

As can be seen from the above description, Policy evaluation is a child problem of Policy iteration.

2 Solution:

Give the simplest policy (such as the random policy above), and then evaluation the policy (this step takes many rounds) and policy improvement (for example, the second column above, greedy Policy).

Then iterate over the process above, that is, iterate policy evaluation/policy improvement.

(Policy evaluation need a lot of wheels to stabilize, if only policy evaluation, you must wait for vπ stability, but here, you may only need to take a few steps, such as the second column above, evaluation cycle to k=3 after doing policy Improvement, we can get the best policyπ*. But in practice often do not know evaluation should circulate a few rounds, this can first cycle 5 or 10 rounds of evaluation, and then carry out policy improvement, after the improved policy again 5 or 10 rounds of evaluation , and then proceed to policy improvement. ）

3) Examples:

The second column above.

Value Iteration

1) To solve the problem:

Like policy iteration, only MDP are given, and the optimal policyπ* for the MDP is found, and the π* is given the optimal value vπ* in each state.

2 Solution:

You can follow the policy iteration method, but only round the policy evaluation phase, then immediately policy improvement "This scheme is value iteration".

Using the partial order relationship of optimal policy:

Thus, as soon as we know the solution to Sub-problems v∗ (s), the solution of v∗ (s) can be obtained by one-step Look-ahead:

The idea of value iteration are to apply these updates iteratively:

3) Benefits:

More intuitive: Intuition:start with final rewards and work backwards

Still works with loopy, stochastic MDPs

Unlike policy iteration, there is no explicit policy (intermediate value functions may not correspond to any policy)

4 Disadvantage: Because the start with final rewards and work backwards, and can only look-ahead one step at a time, so intuitively, the number of cycles will be very large to converge to the optimal strategy.

4) Examples:

Finally, the summary of synchronous Dynamic programming algorithms is given:

Extensions to Dynamic programming

1) Asynchronous Dynamic programming

In-place Dynamic Programming
Prioritised sweeping
Real-time dynamic programming

2) Sample Backups

DP uses full-width backups, DP is effective for medium-sized problems (millions of States)
Every successor State and action is considered
Using knowledge of the MDP transitions and reward function

Using Sample Rewards and sample transitions <s, A, R, S ':
Advantages:
Model-free:no advance knowledge of MDP required
Breaks The curse of dimensionality through sampling
Cost ' backup is constant, independent of n= | s|

3) Approximate Dynamic programming

Note that the convergence of all the above methods has been proven, and not given here.

How do we know this value iteration converges to v∗?
Or that iterative policy evaluation converges to vπ?
and therefore that policy iteration converges to v∗?
Is the solution unique?
How fast does these algorithms converge?
These questions are resolved by contraction mapping theorem

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More