In the study of reinforcement Learning (four) using Monte Carlo Method (MC), we discuss the method of solving reinforcement learning problem by Monte Carlo method, although Monte Carlo method is flexible and does not need the state transition probability model of environment, but it needs all sampling sequences to undergo a complete state sequence. If we do not have a complete sequence of states, then we cannot solve it using Monte Carlo method. In this paper, we discuss the method of solving the reinforcement learning problem without using the complete state sequence: Sequential difference (temporal-difference, TD).
Timing difference This article corresponds to the sixth chapter of the Sutton Book and the fourth part of the UCL intensive learning course, part five.
1. Introduction to Timing differential TD
Sequential difference method is similar to Monte Carlo method, and it is not a model-based reinforcement learning problem solving method. So in the previous definition of the non-model-based reinforcement learning control problem and the definition of predictive problem, it still applies here.
Predictive problem: 5 elements of a given reinforcement learning: state set $s$, action set $a$, instant reward $r$, Decay factor $\gamma$, given strategy $\pi$, solving state value function of the strategy $v (\PI) $
Control problem: That is to solve the optimal value function and strategy. 5 elements of a given reinforcement learning: state set $s$, action set $a$, instant reward $r$, decay factor $\gamma$, exploratory rate $\epsilon$, optimal action Value function $q_{*}$ and optimal strategy $\pi_{*}$
The method of calculating state harvesting in Monte Carlo method is: $ $G _t =r_{t+1} + \gamma r_{t+2} + \gamma^2r_{t+3}+ ... \gamma^{t-t-1}r_{t}$$
But for the time series difference method, we have no complete state sequence, only part of the state sequence, then how can we approximate to find a certain state of the harvest? Review Reinforcement Learning (ii) Behrman equation in Markov decision process (MDP): $ $v _{\pi} (s) = \mathbb{e}_{\pi} (r_{t+1} + \gamma V_{\pi} (s_{t+1}) | S_t=s) $$
This inspired us to use $r_{t+1} + \gamma V (s_{t+1}) $ to approximate the substitution of harvest $g_t$, generally we put $r_{t+1} + \gamma V (s_{t+1}) $ as TD target value. $R _{t+1} + \gamma V (s_{t+1})-V (s_t) $ is called TD error, the process of using TD target value approximation instead of harvesting $g (t) $ is called Bootstrap (bootstrapping). So we only need two consecutive states and the corresponding reward, we can try to solve the reinforcement learning problem.
Now that we have our own approximate harvest $g_t$ expression, we can solve the predictive problem and control problem of time series difference.
2. Solving the predictive problem of sequential differential TD
The prediction problem of sequential difference is similar to that of Monte Carlo method, but there are two different points. The first is that the expression of $g_t$ is different, the expression of time difference G (t) is: $ $G (t) = r_{t+1} + \gamma V (s_{t+1}) $$
The second is that the coefficient of the iteration is slightly different, and the iterative formula of the retrospective Monte Carlo method is: $ $V (s_t) = V (s_t) + \frac{1}{n (s_t)} (G_t-v (s_t)) $$
Since we do not have a complete sequence in the timing difference, there is no corresponding number of times $n (s_t) $, which is usually replaced by a factor of [0,1] $\alpha$. The iterative formula for the value function of this sequential difference is: $ $V (s_t) = V (s_t) + \alpha (g_t-v (s_t)) $$$ $Q (s_t, a_t) = Q (s_t, a_t) +\alpha (G_t-q (s_t, a_t)) $$
Here we use a simple example to see the difference between the Monte Carlo method and the sequential difference method to solve the prediction problem.
It is assumed that our reinforcement learning problem has a A, B, two states, the model is unknown and does not involve strategy and behavior. Only state conversions and instant rewards are involved. A total of 8 complete status sequences are as follows:
①a,0,b,0②b,1③b,1④b,1⑤b,1⑥b,1⑦b,1⑧b,0
Only the first state sequence is stateful, and the remaining 7 have only one state. Set the attenuation factor $\gamma =1$.
First, we solve the prediction problem by Monte Carlo method. Since only the first sequence contains state A, the value of a can only be computed by the first sequence, and it is equivalent to calculating the harvest of State A in that sequence: $ $V (a) = G (a) = R_a + \gamma R_b = 0$$
For B, you need to average the harvested value in the 8 series, and the result is 6/8.
Then take a look at the process of solving the sequential difference method. Its harvest is calculated by applying the estimated value of its subsequent state when calculating a state value in a state sequence, and for B it is always terminated, without subsequent states, so its value is averaged directly with its harvest value in 8 sequences, and the result is 6/8.
For A, only the first sequence appears, its value is: $ $V (A) = r_a + \gamma V (B) = \frac{6}{8}$$
From the above example, we can also see the difference between the Monte Carlo method and the sequential difference method to solve the prediction problem.
One is that the sequential difference method can be learned before the results are known, can also be learned in the absence of results, but also in the continuous environment of learning, and the Monte Carlo law to wait until the final results to learn, time series difference method can be more quickly and flexibly update the state of value estimation, which in some cases has very important practical significance.
Second, the time difference method in updating the state value is the TD target value, that is, based on the immediate reward and the estimated value of the next state to replace the current state at the end of the state sequence of the harvest, is the current state value of the biased estimate, and the Monte Carlo law uses the actual harvest to update the state value, is the unbiased estimation of the state value under a certain strategy, which is dominated by Monte Carlo method.
Third, although the value obtained by Time series difference method is biased, but its variance is lower than that obtained by Monte Carlo method, and is sensitive to initial value, it is usually more efficient than Monte Carlo method.
It can be seen from the above description that the advantage of time series difference method is relatively large, so the mainstream reinforcement learning method is based on the sequential difference. Later articles will also be mainly based on the sequential difference method to extend the discussion.
3. N-Step sequential differential
In the second section of the sequential difference method, we use the $r_{t+1} + \gamma V (s_{t+1}) $ to approximate the substitution of harvest $g_t$. That is a step forward to approximate our harvest $g_t$, then can we move forward two steps? Of course, the approximate expression of our Harvest $g_t$ is: $ $G _t^{(2)} = r_{t+1} + \gamma r_{t+2} + \gamma^2v (s_{t+2}) $$
From two to three, to N, we can conclude that N-step sequential differential harvesting $g_t^{(n)}$ expression is: $ $G _t^{(n)} = r_{t+1} + \gamma r_{t+2} + ... + \gamma^{n-1} r_{t+n} + \gamma ^NV (S_{t+n}) $$
When n becomes larger, tends to infinity, or tends to use a complete state sequence, the N-step sequential difference is equivalent to the Monte Carlo method.
In the case of N-Step sequential difference, the difference between the average timing difference is the difference in how the harvesting is computed. So since there is this N-step argument, then how many steps n is good? How to measure n's good or bad? We'll discuss it in the next section.
4. $TD (\LAMBDA) $
N-Step sequential difference selection of the number of steps as a better calculation parameter is a need to try the super-parameter tuning problem. In order to comprehensively consider the predictions of all steps without increasing the computational complexity, we introduced a new [0,1] parameter, $\lambda$, that defines the $\lambda-$ harvest as the sum of the harvests of N from 1 to $\infty$ all steps multiplied by weights. The weight of each step is $ (1-\LAMBDA) \lambda^{n-1}$, so the calculation formula for $\lambda-$ Harvest is expressed as: $ $G _t^{\lambda} = (1-\LAMBDA) \sum\limits_{n=1}^{\infty} \lambda^{n-1}g_t^{(n)}$$
Then we can get an iterative formula for the value function of $TD (\LAMBDA) $: $ $V (s_t) = V (s_t) + \alpha (G_t^{\lambda}-V (s_t)) $$$ $Q (s_t, a_t) = Q (s_t, a_t) +\alpha (g_t^{\lambda}-Q (s_t, a_t)) $$
What is the reason for the weight of each step harvest defined as $ (1-\LAMBDA) \lambda^{n-1}$? As shown in the image, it can be seen that as n increases, the weight of the nth step harvest is exponentially attenuated. When the stop state is reached at the T moment, the unassigned weights are all given the actual harvest value of the terminating state. This allows the weight of all n-step harvests in a complete state sequence to add up to 1, and the farther away from the current state, the smaller the weight of the harvest.
Once Upon a $TD (\LAMBDA) $, the value of a state $v (s_t) $ is obtained by $g_t$, and $g_t$ is indirectly computed by all subsequent state values, so it can be thought that updating the value of a state requires knowing the value of all subsequent states. That is, you must undergo a complete state sequence to get instant rewards for each state, including the terminating state, to update the value of the current state. This is the same as the Monte Carlo requirement, so $TD (\LAMBDA) $ has the same disadvantage as Monte Carlo method. When $\LAMBDA = 0 $, it is the ordinary time difference method referred to in the second section, when $\LAMBDA = 1 $, is the Monte Carlo method.
$TD (\LAMBDA) $, which can analyze the impact of our state on subsequent states. For example, the mouse in succession received 3 rings and 1 light signal after the electric shock, then in the analysis of the cause of the shock, is the factor of the bell is more important or bright light factors more important? If the cause of a mouse shock is considered to have received a higher number of bells before, it is said that the frequency-inspired (frequency heuristic) type, and the electric shock attributed to the recent few states of the impact, is called the nearest inspiration (recency heuristic) type.
If you introduce a value to each State: utility (Eligibility, E) to indicate the effect of the State on subsequent states, you can use both of these heuristics. The utility value of all States is always called the utility trace (Eligibility traces,es). Defined as: $ $E _0 (s) = 0$$$ $E _t (s) = \gamma\lambda e_{t-1} (s) +1 (s_t=s), \;\;s.t.\; \lambda,\gamma \in [0,1]$$
At this point we $td (\LAMBDA) $ value Function Update formula can be expressed as: $$\delta_t = r_{t+1} + \gamma V (s_{t+1})-V (s_t) $$$ $V (s_t) = V (s_t) + \alpha\delta_te_t ( s) $$
Perhaps some people will ask, this forward and inverse of the formula looks different ah, is not a different logic? In fact, the two are equivalent. Now let's deduce the inverse of the updated formula from the forward direction. $$\begin{align} G_t^{\lambda}- v (s_t) &= - v (s_t) + (1-\LAMBDA) \lambda^ {0} (R_{t+1} + \gamma v (s_{t+1})) \ &+ (1-\LAMBDA) \lambda^{1} (R_{t+1} + \gamma r_{t+2} + \gamma^ 2V (s_{t+2})) \ &+ (1-\LAMBDA) \lambda^{2} (r_{t+1} + \gamma r_{t+2} + \gamma^2 R_{t+3} + \gamma^3v (s_{t+3})) \ &+ ... \\& =- v (s_t) + (\GAMMA\LAMBDA) ^0 (r_{t+1} + \gamma v (s_{t+1})-\gamma\lambda V (s_{t+1}) ) \ & + (\GAMMA\LAMBDA) ^1 (r_{t+2} + \gamma v (s_{t+2})-\ Gamma\lambda V (s_{t+2}) ) \ & + (\GAMMA\LAMBDA) ^2 (r_{t+3} + \gamma v (s_{t+3})-\gamma\lambda V (S_{ T+3}) ) \\ &+... \ \ & = (\GAMMA\LAMBDA) ^0 (r_{t+1} + \gamma v (s_{t+1})-V (s_t) ) \\ & + (\GAMMA\LAMBDA) ^1 (r_{t+2} + \gamma v (s_{t+2})-V (S_{t+1}) \\ & + (\GAMMA\LAMBDA) ^2 (r_{t+3} + \gamma v (s_{t+3})-V (s_{t+2}) \ \ & + ... \ \ &A mp = \delta_t + \gamma\lambda \delta_{t+1} + (\GAMMA\LAMBDA) ^2 \delta_{t+2} + ... \end{align} $$
It can be seen that the forward TD error and the inverse TD error are in fact consistent.
5. Solving the problem of timing difference control
Now let's go back to the normal timing difference and see how it solves the problem. Recalling the previous Monte Carlo method of online control, we use $\epsilon-$ greedy method to do value iteration. For time series difference, we can also use $\epsilon-$ greedy method to value iteration, and the difference of the Monte Carlo method on-line control is mainly that the harvesting method is different. The most common of the online control (on-policy) algorithms for sequential difference is the SARSA algorithm, which we explain separately in the next article.
In addition to online control, we can also do offline control (off-policy), offline control and online control is mainly the difference between online control generally only one strategy (the most common is the $\epsilon-$ greedy method). While offline control generally has two strategies, one of which is the most common $\epsilon-$ greedy method, which is used to select new actions, and the other, the most common greedy method, is used to update value functions. The most common method of offline control for time series difference is the q-learning algorithm, which we explain separately in the next article.
6. Timing Difference Summary
Timing difference and Montcaro Fabi It is more flexible, more learning ability, so it is the mainstream of the reinforcement learning solution problem, now most of the reinforcement learning and even deep reinforcement learning is based on the idea of sequential difference. So we'll focus on that later.
In the next article we will discuss the on-line control algorithm SARSA of timing difference.
(Welcome, please indicate the source.) Welcome to communicate: liujianping-ok@163.com)