Reinforcement learning, Enhanced learning: Value Function approximation

Source: Internet
Author: User



Last content:

Model-free Control. The so-called model-free refers to the absence of a given MDP (that is, MDP is unknown, not even the MDP process).

It is hoped that the control is not given in the case of MDP (ideally the policy is not given, optimise the value function of an unknown MDP).

Model-free control has two main methods: On-policy learning and Off-policy learning; On-policy learning is also divided into On-policy MC and On-policy TD.

Two ways to Optimal MDP: policy iteration and value iteration. Policy iteration: In the case where MDP is known and the policy is unknown, take a random policyπ, then evaluate the policyπ, then improve the pi to get better policyπ ', and then to π ' Evaluation, and then improve, loops the process. ===== "This solves the policyπ unknown problem (random one π, then loops evaluation and improvement).

Model-free Prediction (Evaluation) Two methods: Monte-carlo learning and temporal-difference learning; both are essentially sampled, the former must be sampled to terminal The state gets the average return (approximate to the real GT), which is only one step ahead of the on-line update. ===== "This solves the problem of MDP's unknown (sampling can be).

Combining the above two, under the assumption of Model-free Control (policy unknown, MDP unknown), can be: A random policyπ, using MC or TD method (sampling to get the approximate GT, rather than the real GT) for policy Evaluation, then proceed to greedy policy improvement, and then loop through the process. ===== said the process was Model-free Policy iteration.





the content of this session:

What we said before is that s, a is finite. Even under the assumption of Model-free, S, A are also sampled (some samples may never be available to you).

So, for the real situation, 1 s, a is limited, but the space is particularly large, it is impossible to calculate all the situation, 2) s, a itself is infinite ....

In the face of the two challenges of the real situation, how to scale up the Model-free methods Forpredictionand Controlfrom The last of the lectures is the content of this sectionto consideration.

The solution is: Estimate value function withfunction approximation











Value Function approximation:

1) Several common forms


Visual inspection the first and third comparisons are often used (the third is often because, for the action, it is not easy to describe the feature (feature), so it is modeled separately for each action), but the action is a lot of different things ... )

2) commonly used function

Requirements differentiable, commonly used have

Linear Combinations of features
Neural network


3) Requirements for training methods

Require a training method that's suitable for non-stationary,non-iiddata
There are incremental, batch.




State-value Function approx. by Stochastic Gradient descent:

1) First, a state is expressed in the form of a feature vector:

X (s) = (x (s) _1, X (s) _2, ..., x (s) _n)

For example:

Distance of robot from landmarks
Trends in the stock market
Piece and pawn configurations in chess


2) If you use linear Value Function approximation, there are:



3) The actual RL, commonly used state-value Function to do prediction:

Incrementalprediction algorithms are as follows:




Action-value Function approx. by Stochastic Gradient descent:

1) First a pair (state, action) expressed as feature vector form:

X (S,a) = (x (s,a) _1, X (s,a) _2, ..., x (s,a) _n)


2) If you use linear Value Function approximation, there are:


3) in the actual RL, the commonly used Action-value Function to do control:

Incremental Control algorithms is as follows:


4) The complete optimal process under the control framework:


In the case of plain English: first random a policy, and then use Mc/td method to adjust W to get approximation value-function, based on approximation value-function, The policy improvement for epo-greedy, and then loops through iterations.

Therefore, value-function approximation only as policy evaluation (prediction), policy improvement (optimal) also requires a separate strategy such as Epo-greedy.








Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.