First, some concepts
Two programming problems of MDP: prediction, given MDP and strategy π, finding out the value function vπvπ control, given MDP, finding the best value function v∗v∗ and the best strategy π∗π∗
Policy Evaluation Strategy Evaluation:
Given a strategy, the state value function obtained from the v0v0,v1v1 to the vπvπ, the K step, the state value function of k+1 step can be obtained by the bellman expectation equation. In this way, the final state value function converges and completes the evaluation of the strategy π.
Policy Iteration Policy iterations: 1. Evaluate the strategy and update the value function using the method of strategy evaluation; 2. Improve the strategy, according to the value function of the previous step, use the greedy principle to update the strategy; 3. The optimal value function V is found by the two steps of the iteration until the optimal strategy π is found.
Value iterations: According to the Bellman optimal equation, each cyclic computation (update) value function; There is no explicit strategy, the greedy calculation method is more direct in the optimal equation. Vk+1 (s) =maxa∈a[ras+γ∑s′∈spass′vk (s′)]