Ng's machine learning class, course resources: cs229-Courseware NetEase Open Class-Video
Problem Mathematical Model:
Five tuples {S, A, Psa, γ, R}, respectively, corresponding to {state, behavior, state S under the probability of a behavior, constant, return}.
Optimization objectives:
Choose a policy to get the best reward: E[r (S0) +γr (S1) +γ2r (S2) + ...], the existence of constant gamma ensures that the proceeds are obtained as quickly as possible.
Optimization function:
According to the Behrman equation,
R (s) represents the direct benefit of executing this policy, and the subsequent heap is the proceeds from subsequent behavior after the policy has been executed.
The optimal strategy satisfies:
Then the most strategy in S state is to satisfy the behavior of the following equation:
In this way, you can iterate over the calculation.
Solution Method:
But the actual operation of the PSA is unknown, so need to count the number of times, for the class of the robot moving example, Ng explained that can let the robot walk, statistics to reach each state number of times.
So the complete implementation of the intensive learning process is this:
Cs229-machinelearning-12 Intensive Learning Notes