Reinforcement Learning (iv) solving with Monte Carlo method (MC)

Source: Internet
Author: User

In reinforcement Learning (iii) using dynamic programming (DP), we discuss the method of solving the problem of reinforcement learning prediction and control problem by dynamic programming. However, since dynamic programming requires the value of a state to be updated each time, it goes back to all possible subsequent states of that State. Results in a large amount of computation for complex problems. At the same time, we can not even know the environment of the state transformation model $p$, then the dynamic programming method is not used at all. How do we solve the problem of reinforcement learning? The Monte Carlo (Monte-calo, MC) to be discussed in this paper is a feasible method.

The Monte Carlo method corresponds to the fifth chapter of the Sutton Book and the fourth part of the UCL intensive learning course, part five.

1. No model-based reinforcement learning problem definition

In the dynamic programming method, two problems of reinforcement learning are defined as follows:

Prediction problem, that is, given 6 elements of reinforcement learning: State set $s$, action set $a$, model state transition probability matrix $p$, instant reward $r$, attenuation factor $\gamma$, given strategy $\pi$, solving the state value function of the strategy $v (\PI) $

Control problem, which is to solve the optimal value function and strategy. 5 elements of a given reinforcement learning: state set $s$, action set $a$, model state transition probability matrix $p$, instant reward $r$, attenuation factor $\gamma$, optimal state value function $v_{*}$ and optimal strategy $\pi_{*}$

It can be seen that the Model State transformation probability matrix $p$ is always known, that is, MDP is known, and for such reinforcement learning problems, we are generally called model-based reinforcement learning problems.

However, there are many reinforcement learning problems, we have no way to get the model state transformation probability matrix $p$, if we still need to solve the reinforcement learning problem, then this is not based on the model of reinforcement learning problem. The general definition of two of its problems is:

Prediction problem, which is the 5 elements of a given reinforcement learning: state set $s$, action set $a$, instant reward $r$, Decay factor $\gamma$, given strategy $\pi$, solving state value function of the strategy $v (\PI) $

Control problem, which is to solve the optimal value function and strategy. 5 elements of a given reinforcement learning: state set $s$, action set $a$, instant reward $r$, decay factor $\gamma$, exploratory rate $\epsilon$, optimal action Value function $q_{*}$ and optimal strategy $\pi_{*}$

The Monte Carlo method discussed in this paper is the above-mentioned non-model-based reinforcement learning problem.

2. Monte Carlo method to solve the characteristics

The post of the word Monte Carlo was also discussed, especially in the previous MCMC series. It is a method of solving problems by sampling approximation. The Monte Carlo method here is different from MCMC, but the idea of sampling is consistent. So how do you sample it?

Monte Carlo method estimates the true value of a state by sampling a number of complete state sequences (episode). The so-called experience is complete, that the sequence must be to reach the end. For example, the problem of chess to win or lose, driving problems successfully reached the end or failure. With many groups experiencing a complete state sequence, we can approximate the estimated state value and then solve the problem of prediction and control.

According to the characteristics of the special law, one is compared with the dynamic programming, it does not need to rely on the model state conversion probability. The second is that it learns from the complete sequence, the more complete the experience, the better the learning effect.

3. Monte Carlo method to solve the problem of intensified learning prediction

Here we first discuss the Monte Carlo method to solve the problem of reinforcement learning control, that is, strategy evaluation. A given policy $\pi$ the complete state sequence with a T state as follows: $ $S _1,a_1,r_2,s_2,a_2,... S_T,A_T,R_{T+1},... r_t, s_t$$

Memory Reinforcement Learning (ii) the definition of value function $v_{\pi} (s) $ in Markov decision process (MDP): $ $v _{\pi} (s) = \mathbb{e}_{\pi} (g_t| s_t=s) = \mathbb{e}_{\pi} (r_{t+1} + \gamma r_{t+2} + \gamma^2r_{t+3}+...| S_t=s) $$

It can be seen that the value function of each state equals all the expectation of the harvest of that state, and the harvest is obtained by summing up the subsequent reward with the corresponding decay product. Then for Monte Carlo method, if the state value of a certain state is required, it is only required that the harvest of all the complete sequences in that state appear to be averaged to approximate the solution, i.e.: $ $G _t =r_{t+1} + \gamma r_{t+2} + \gamma^2r_{t+3}+ ... \gamma^{t-t-1}r_{t}$$ $ $v _{\pi} (s) \approx average (g_t), s.t. s_t=s$$

It can be seen that the solution of the prediction problem is very simple. However, there are several points that can be optimized for consideration.

The first point is that the same state may recur in a complete sequence of states, so how is the harvest of that state calculated? There are two ways of solving this. The first is to include only the harvest value of the first occurrence in the state sequence in the calculation of the average harvest; the other is to calculate the corresponding harvest value for each occurrence in a sequence of states and to include the calculation of the average harvest. The corresponding Monte Carlo method of the two methods are called: first visit and each access (every visit) Monte Carlo method. The second method is more computationally significant than the first, but it is more appropriate than the first method in scenarios where the complete sample sequence is small.

The second point is a progressive update average (incremental mean). In the formula for predicting the problem above, we have a average formula, which means to preserve the sum of all the harvested values of the state and the last averaging. This is a waste of too much storage space. A better method is to calculate the mean value of the harvest in an iterative way, that is, each time you save the last iteration of the harvest mean and the number of times, when the calculation of the current wheel harvest, you can calculate the current round harvest mean and the number of times. This process can be easily understood by the following formula: $$\mu_k = \frac{1}{k}\sum\limits_{j=1}^k X_j = \frac{1}{k} (x_k + \sum\limits_{j=1}^{k-1}x_j) = \frac{ 1}{k} (X_k + (k-1) \mu_{k-1}) = \mu_{k=1} + \frac{1}{k} (X_k-\mu_{k-1}) $$

The above state value formula can be modified as: $ $N (s_t) = N (s_t) +1 $$ $ $V (s_t) = V (s_t) + \frac{1}{n (s_t)} (G_t-v (s_t)) $$

In this way, whether the amount of data is more or less, the algorithm needs the memory is basically fixed.

Sometimes, especially when we do distributed iterations of massive data, we may not be able to accurately calculate the current number of times $n (s_t) $, when we can use a factor $\alpha$ instead, namely: $ $V (s_t) = V (s_t) + \alpha (g_t-v (s_t)) $$

For the Action Value function $q (s_t,a_t) $, it is similar, for example, to the last formula above, the action Value function version is: $ $Q (s_t, a_t) = Q (s_t, a_t) +\alpha (G_t-q (s_t, a_t)) $$

The above is the whole process of solving the prediction problem by Monte Carlo method, and then we see the control problem solving.

4. Monte Carlo method to solve the problem of reinforcement learning control

The thinking of solving control problem by Monte Carlo method is similar to the idea of dynamic programming value iteration. Recall the idea of dynamic programming value iteration, each iteration of the strategy to evaluate first, calculate the value of $v_k (s) $, and then based on a certain method (such as greedy method) to update the current policy $\pi$. Finally, the optimal value function $v_{*}$ and the optimal strategy $\pi_{*}$ are obtained.

Compared with the dynamic programming, Monte Carlo method differs in three points: first, the method of predicting problem strategy evaluation is different, this third section has already spoken. The second is that the Monte Carlo method generally optimizes the optimal action value function $q_{*}$, rather than the state value function $v_{*}$. Third, dynamic programming is generally based on greedy method updating strategy. Monte Carlo method is generally used to update $\epsilon-$ greedy method. This $\epsilon$ is the 8th model element $\epsilon$ We talked about in the foundation of Reinforcement Learning (a) model. $\epsilon-$ greedy method by setting a small Ε value, using the probability of $1-\epsilon$ greedy to choose the behavior currently considered to be the maximum behavior value, and the probability of $\epsilon$ randomly from all m optional behavior selection behavior. The formula can be expressed as: $$\pi (a|s) = \begin{cases} \epsilon/m + 1-\epsilon & {if\; a^{*} = \arg\max_{a \in a}q (s,a)}\\ \epsilon/m &am P {Else} \end{cases}$$

In the actual solution of the control problem, in order to make the algorithm can converge, the general $\epsilon$ will gradually decrease with the iterative process of the algorithm, and tend to 0. So in the early iterations, we encourage exploration, and in the late, because we have enough to explore the volume, began to become conservative, mainly greedy, so that the algorithm can be stable convergence. So we can get a diagram similar to the dynamic plan:

5. Monte Carlo method control problem algorithm flow

This paper summarizes the algorithm flow of the Monte Carlo method to solve the problem of reinforcement learning control, the algorithm is online (on-policy) version, the relative algorithm also has the offline (Off-policy) version. The difference between online and offline we will speak in subsequent articles. At the same time, we are using every-visit, that is, each occurrence of the same state in a sequence of states, will calculate the corresponding harvest value.

The algorithm flow of the online Monte Carlo method to solve the problem of reinforcement learning control is as follows:

Input: State set $s$, action set $a$, instant reward $r$, attenuation factor $\gamma$, exploratory rate $\epsilon$

Output: Optimal action Value function $q_{*}$ and optimal strategy $\pi_{*}$

1. Initialize all Action Value $q (s,a) = 0$, number of states $n (s,a) = 0$, number of samples $k=0$, random initialization of a policy $\pi$

2. K=k+1, based on the strategy $\pi$ the K-Times Monte Carlo sampling, get a complete status sequence: $ $S _1,a_1,r_2,s_2,a_2,... S_T,A_T,R_{T+1},... r_t, s_t$$

3. For each state behavior that appears in the status sequence to $ (s_t, a_t) $, calculate its harvest $g_t$, update its count $n (S,a) $ and the Behavior value function $q (s,t) $:$ $G _t =r_{t+1} + \gamma r_{t+2} + \gamma^2r_ {t+3}+ ... \gamma^{t-t-1}r_{t}$$$ $N (s_t, a_t) = N (s_t, a_t) +1 $$$ $Q (s_t, a_t) = Q (s_t, a_t) + \frac{1}{n (s_t, a_t)} (g_t -Q (s_t, a_t)) $$

4. Update the current $\epsilon-$ greedy strategy based on the newly calculated action value: $$\epsilon = \frac{1}{k}$$$$\pi (a|s) = \begin{cases} \epsilon/m + 1-\epsilon & { if\; a^{*} = \arg\max_{a \in a}q (s,a)}\\ \epsilon/m & {Else} \end{cases}$$

5. If all $q (s,a) $ converge, then all $q (s,a) $ corresponds to the optimal action Value function $q_{*}$. The corresponding policy $\pi (A|s) $ is the optimal strategy $\pi_{*}$. Otherwise, go to the second step.

6. A summary of the Monte Carlo method for solving reinforcement learning problems

The Monte Carlo method is the second one to solve the problem of hardening, and it is also the first method to solve the problem that is not based on the model. It avoids the complexity of dynamic solver and can be used for massive data and complex models without knowing the environment transformation model in advance. But it also has its own shortcomings, which is that it requires a complete sequence of states for each sample. If we do not have a complete sequence of states, or have difficulty getting more complete state sequences, then Monte Carlo method is not very useful, that is to say, we also need to find other more flexible non-model-based hardening problem solving methods.

In the next chapter, we discuss the method of solving the problem of reinforcement learning prediction and control by using the sequential difference method.

(Welcome, please indicate the source.) Welcome to communicate: liujianping-ok@163.com)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.