As we have already said, the aim of reinforcement learning is to solve the optimal strategy of Markov decision making process (MDP) so that it can obtain the maximum vπ value in any initial state. (This paper does not consider enhanced learning in non-Markov environments and incomplete observable Markov decision Processes (POMDP).)
So how to solve the optimal strategy. There are three basic solutions:
Dynamic programming method (programming methods)
Monte Carlo Method (Monte Carlo methods)
Time difference method (temporal difference).
Dynamic programming is one of the most basic algorithms and the basis of understanding the following algorithms, so this paper introduces the dynamic programming method to solve MDP. This article assumes a complete knowledge of the MDP model m= (S, A, Psa, R).
1. Behrman equation (Bellman equation)
In the previous article, we got the expression of vπ and qπ and wrote the following form
In dynamic programming, the above two formulas are called Behrman equations, which indicate the relationship between the value function of the current state and the value function of the next state.
The optimization target π* can be expressed as:
The corresponding state-value functions and behavior-valued functions for the optimal strategy π* are v* (s) and q* (S, a), which are easily known by their definitions, v* (s) and q* (S, a) have the following relationships:
The state value function and the behavior value function satisfy the following Bayes optimality equation (Bellman optimality equation), respectively:
With the Behrman equation and the Bayes optimality equation, we can use dynamic programming to solve MDP.
2. Strategy Estimation (Policy Evaluation)
First, for any strategy π, how do we calculate its state value function vπ (s). This problem is called strategy estimation,
As mentioned earlier, for deterministic policies, value functions
Now extended to a more general case, if PI (s) corresponds to a certain strategy π, the corresponding action A has multiple possibilities, each of which may be recorded as π (A|s), then the state value function is defined as follows:
In general, an iterative method is used to update the state value function, first assigning all vπ (s) The initial value to 0 (other states can also be assigned to any value, but the absorption state must be assigned a value of 0), and then use the following formula to update all the state S value function (k+1 iteration):
For vπ (s), there are two ways to update
The first type: the function of each state value of the K-iteration [Vk (S1), Vk (S2), Vk (S3) ...] Saved in an array, the k+1 vπ (s) is computed using the K-th vπ (s ') and the result is saved in the second array.
The second is that only one array is used to hold each state value function, and whenever a new value is obtained, the old value is overwritten, such as [Vk+1 (S1), vk+1 (S2), Vk (S3):], k+1 (s) of vπ iterations may be k+1 (s ') obtained by vπ iterations.
In general, we use the second method to update the data because it takes advantage of the new value in time to converge faster. The overall strategy estimation algorithm is shown in the following figure:
3. Strategy Improvement (Policy improvement)
The purpose of the strategy estimation in the previous section is to find a better strategy, which is called Strategy Improvement (policy improvement).
Suppose we have a strategy π, and the value function vπ (s) is determined for all its states. For a state s, there is an action a0=π (s). So if we do not use the action A0 in the State S, and the other action a≠π (s) will be better. To judge the good or bad, we need to calculate the behavior value function qπ (s,a), the formula we have previously said:
The criterion is: whether qπ (S,a) is greater than vπ (s). If qπ (s,a) > vπ (s), then at least stating that the new strategy "only takes action a under state S, other states that follow strategy π" is better than the old strategy "all States follow strategy π" overall.
The strategy improvement theorem (Policy improvement theorem): π and π ' are two deterministic strategies, and if there is qπ (s,π ' (s)) ≥vπ (s) for all States S∈s, then strategy π ' must be better than strategy π, or at least as good. The inequality is equivalent to vπ ' (s) ≥vπ (s).
With the method of improving the strategy and the policy improvement theorem on a certain state s, we can traverse all States and all possible action A, and adopt the greedy strategy to obtain the new strategy π '. That is, for all s∈s, the following update strategy is used:
The process of using greedy strategy of value function to acquire new strategy and improve old policy is called policy improvement.
Finally, you may wonder if the greedy strategy can converge to the optimal strategy, where we assume that the strategy improvement process has converged, that is, all s,vπ ' (s) equals vπ (s). Then according to the above strategy update the formula, you can know for all the s∈s is established:
But this is exactly what we call the Bellman optimality equation in 1, so π and π are necessarily the optimal strategies. Magic Bar.
4. Strategy iterations (Policy iteration)
The strategy iteration algorithm is a combination of the above two sections. Suppose we have a strategy π, then we can use policy evaluation to get its value function vπ (s), and then according to policy improvement get a better strategy π ', then calculate vπ ' (s), then get a better strategy π ', The entire process sequence is shown in the following figure:
The complete algorithm is shown in the following figure:
5. Values iteration (value iteration)
From the above we can see that the strategy iterative algorithm contains a process of policy estimation, and the strategy estimation needs to scan (sweep) all states several times, in which the huge computational amount directly affects the efficiency of the strategy iterative algorithm. We have to get the exact vπ value. In fact, there are several ways to shorten the process of strategy estimation under the condition of guaranteeing the convergence of the algorithm.
Value iteration is one of the most important. Each iteration of it only scans (sweep) each state once. Each iteration of a value iteration is updated for all s∈s according to the following formula:
That is, the maximum vπ (s) value that can be obtained is assigned to vk+1 directly at the k+1 iteration of the value iteration. The value iteration algorithm directly updates the current V (s) with the next s ' (s ') that may go to it, and the algorithm does not even need to store the policy π. In fact, this update also changes the strategy Πk and V (s) valuation VK (s). Until the end of the algorithm, we then use the V value to obtain the optimal π.
In addition, the value iteration can be interpreted as an iterative approach to the Bayesian optimality equation shown in 1.
The complete algorithm for the value iteration is shown in the figure:
From the above algorithm, we know that the last step of the value iteration is to obtain the optimal strategy π* according to v* (s).
In general, both value iterations and policy iterations require countless rounds of iteration to accurately converge to v* and π*, and in practice we tend to set a threshold as the abort condition, where the vπ (s) value changes very, very near, and we think we have the best strategy. In the limited MDP (discounted finite MDPs) with discounted returns, the two algorithms can converge to the optimal strategy π* in the finite iterations.
At this point we understand the dynamic programming of Markov decision-making process, the advantage of dynamic programming is that it has a good mathematical interpretation, but the dynamic requirements of a fully known environment model, which in reality is difficult to do. In addition, when the number of States is large, the efficiency of the dynamic programming method will also be a problem. The next article introduces the Monte Carlo method, which has the advantage of not requiring a complete environmental model.
PS: If what is not clear where, welcome to the proposed, I will add the instructions ...
Resources:
[1] R.sutton et al reinforcement learning:an Introduction, 1998
[2] Xu, reinforcement learning and its application in navigation and control of mobile robots [d],2002