Last content:
Model-free Control. The so-called model-free refers to the absence of a given MDP (that is, MDP is unknown, not even the MDP process).
It is hoped that the control is not given in the case of MDP (ideally the policy is not given, optimise the value function of an unknown MDP).
Model-free control has two main methods: On-policy learning and Off-policy learning; On-policy learning is also divided into On-policy MC and On-policy TD.
Two ways to Optimal MDP: policy iteration and value iteration. Policy iteration: In the case where MDP is known and the policy is unknown, take a random policyπ, then evaluate the policyπ, then improve the pi to get better policyπ ', and then to π ' Evaluation, and then improve, loops the process. ===== "This solves the policyπ unknown problem (random one π, then loops evaluation and improvement).
Model-free Prediction (Evaluation) Two methods: Monte-carlo learning and temporal-difference learning; both are essentially sampled, the former must be sampled to terminal The state gets the average return (approximate to the real GT), which is only one step ahead of the on-line update. ===== "This solves the problem of MDP's unknown (sampling can be).
Combining the above two, under the assumption of Model-free Control (policy unknown, MDP unknown), can be: A random policyπ, using MC or TD method (sampling to get the approximate GT, rather than the real GT) for policy Evaluation, then proceed to greedy policy improvement, and then loop through the process. ===== said the process was Model-free Policy iteration.
the content of this session:
What we said before is that s, a is finite. Even under the assumption of Model-free, S, A are also sampled (some samples may never be available to you).
So, for the real situation, 1 s, a is limited, but the space is particularly large, it is impossible to calculate all the situation, 2) s, a itself is infinite ....
In the face of the two challenges of the real situation, how to scale up the Model-free methods Forpredictionand Controlfrom The last of the lectures is the content of this sectionto consideration.
The solution is: Estimate value function withfunction approximation
Value Function approximation:
1) Several common forms
Visual inspection the first and third comparisons are often used (the third is often because, for the action, it is not easy to describe the feature (feature), so it is modeled separately for each action), but the action is a lot of different things ... )
2) commonly used function
Requirements differentiable, commonly used have
Linear Combinations of features
Neural network
3) Requirements for training methods
Require a training method that's suitable for non-stationary,non-iiddata
There are incremental, batch.
State-value Function approx. by Stochastic Gradient descent:
1) First, a state is expressed in the form of a feature vector:
X (s) = (x (s) _1, X (s) _2, ..., x (s) _n)
For example:
Distance of robot from landmarks
Trends in the stock market
Piece and pawn configurations in chess
2) If you use linear Value Function approximation, there are:
3) The actual RL, commonly used state-value Function to do prediction:
Incrementalprediction algorithms are as follows:
Action-value Function approx. by Stochastic Gradient descent:
1) First a pair (state, action) expressed as feature vector form:
X (S,a) = (x (s,a) _1, X (s,a) _2, ..., x (s,a) _n)
2) If you use linear Value Function approximation, there are:
3) in the actual RL, the commonly used Action-value Function to do control:
Incremental Control algorithms is as follows:
4) The complete optimal process under the control framework:
In the case of plain English: first random a policy, and then use Mc/td method to adjust W to get approximation value-function, based on approximation value-function, The policy improvement for epo-greedy, and then loops through iterations.
Therefore, value-function approximation only as policy evaluation (prediction), policy improvement (optimal) also requires a separate strategy such as Epo-greedy.