Reinforcement learning, Enhanced learning: Value Function approximation

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Last content:

Model-free Control. The so-called model-free refers to the absence of a given MDP (that is, MDP is unknown, not even the MDP process).

It is hoped that the control is not given in the case of MDP (ideally the policy is not given, optimise the value function of an unknown MDP).

Model-free control has two main methods: On-policy learning and Off-policy learning; On-policy learning is also divided into On-policy MC and On-policy TD.

Two ways to Optimal MDP: policy iteration and value iteration. Policy iteration: In the case where MDP is known and the policy is unknown, take a random policyπ, then evaluate the policyπ, then improve the pi to get better policyπ ', and then to π ' Evaluation, and then improve, loops the process. ===== "This solves the policyπ unknown problem (random one π, then loops evaluation and improvement).

Model-free Prediction (Evaluation) Two methods: Monte-carlo learning and temporal-difference learning; both are essentially sampled, the former must be sampled to terminal The state gets the average return (approximate to the real GT), which is only one step ahead of the on-line update. ===== "This solves the problem of MDP's unknown (sampling can be).

Combining the above two, under the assumption of Model-free Control (policy unknown, MDP unknown), can be: A random policyπ, using MC or TD method (sampling to get the approximate GT, rather than the real GT) for policy Evaluation, then proceed to greedy policy improvement, and then loop through the process. ===== said the process was Model-free Policy iteration.

the content of this session:

What we said before is that s, a is finite. Even under the assumption of Model-free, S, A are also sampled (some samples may never be available to you).

So, for the real situation, 1 s, a is limited, but the space is particularly large, it is impossible to calculate all the situation, 2) s, a itself is infinite ....

In the face of the two challenges of the real situation, how to scale up the Model-free methods Forpredictionand Controlfrom The last of the lectures is the content of this sectionto consideration.

The solution is: Estimate value function withfunction approximation

Value Function approximation:

1) Several common forms

Visual inspection the first and third comparisons are often used (the third is often because, for the action, it is not easy to describe the feature (feature), so it is modeled separately for each action), but the action is a lot of different things ... ）

2) commonly used function

Requirements differentiable, commonly used have

Linear Combinations of features
Neural network

3) Requirements for training methods

Require a training method that's suitable for non-stationary,non-iiddata
There are incremental, batch.

State-value Function approx. by Stochastic Gradient descent:

1) First, a state is expressed in the form of a feature vector:

X (s) = (x (s) _1, X (s) _2, ..., x (s) _n)

For example:

Distance of robot from landmarks
Trends in the stock market
Piece and pawn configurations in chess

2) If you use linear Value Function approximation, there are:

3) The actual RL, commonly used state-value Function to do prediction:

Incrementalprediction algorithms are as follows:

Action-value Function approx. by Stochastic Gradient descent:

1) First a pair (state, action) expressed as feature vector form:

X (S,a) = (x (s,a) _1, X (s,a) _2, ..., x (s,a) _n)

2) If you use linear Value Function approximation, there are:

3) in the actual RL, the commonly used Action-value Function to do control:

Incremental Control algorithms is as follows:

4) The complete optimal process under the control framework:

In the case of plain English: first random a policy, and then use Mc/td method to adjust W to get approximation value-function, based on approximation value-function, The policy improvement for epo-greedy, and then loops through iterations.

Therefore, value-function approximation only as policy evaluation (prediction), policy improvement (optimal) also requires a separate strategy such as Epo-greedy.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Reinforcement learning, Enhanced learning: Value Function approximation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Reinforcement learning, Enhanced learning: Value Function approximation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support