Reinforcement Learning (vi) sequential differential on-line control algorithm Sarsa

Source: Internet
Author: User

In reinforcement learning (v) using the sequential Difference method (TD), we discuss the method of solving the reinforcement learning prediction problem by using time series difference, but the solving process of the control algorithm is not in-depth, this paper gives a detailed discussion on the on-line control algorithm Sarsa of time series difference.

Sarsa This article corresponds to the sixth chapter of the Sutton book and the fifth part of the UCL intensive learning course.

1. Introduction of SARSA algorithm

Sarsa algorithm is a method of using time difference to solve the problem of reinforcement learning control, and in retrospect, our control problem can be expressed as: 5 elements of a given reinforcement learning: state set $s$, action set $a$, instant reward $r$, attenuation factor $\gamma$, exploration rate $\epsilon$, The optimal Action Value function $q_{*}$ and optimal strategy $\pi_{*}$ are solved.

This kind of reinforcement learning problem solving does not need the environment state transformation model, is not based on the model of reinforcement learning problem solving method. For its control problem solving, and the Monte Carlo method is similar, is the value iteration, namely through the value function update, updates the current strategy, then through the new strategy, produces the new state and the immediate reward, then updates the value function. Continue until the value function and strategy converge.

In retrospect, the control problem of time series difference method can be divided into two categories, one is online control, that is, we always use a strategy to update value function and select New action. The other is offline control, using two control strategies, one for selecting new actions, and another for updating value functions.

Our Sarsa algorithm, which belongs to the online control category, has been to use a strategy to update value functions and choose New actions, and this strategy is $\epsilon-$ greedy method, in the reinforcement learning (four) using Monte Carlo (MC) solution, we $\epsilon-$ Greed is explained in detail by setting a smaller $\epsilon$ value, using the probability of $1−\epsilon$ to greedily Select the behavior currently considered to be the maximum behavioral value, and randomly selecting the behavior from all m selectable behaviors with the probability of $\epsilon$. The formula can be expressed as: $$\pi (a|s) = \begin{cases} \epsilon/m + 1-\epsilon & {if\; a^{*} = \arg\max_{a \in a}q (s,a)}\\ \epsilon/m &am P {Else} \end{cases}$$

2. SARSA Algorithm Overview

As the name of the SARSA algorithm itself, it is actually composed of several letters of S,a,r,s,a. The s,a,r represent the State, action, reward (reward), which is the symbol we've been using before. This process is reflected in:

In the iteration, we first select an action $a$ based on the $\epsilon-$ greedy method in the current state $s$, so the system will go to a new state $s ' $, while giving us an instant reward $r$, in the new state $s ' $, we will be based on the $\epsilon-$ greedy law in the State $S ' $ Select an action of $ A ' $, but note that at this time we do not execute this action $ A ' $, just to update our value function, the value function of the update formula is: $ $Q (s,a) = Q (s,a) + \alpha (R+\gamma q (S ', A ')-Q (S,a) )$$

Among them, $\gamma$ is the attenuation factor, and $\alpha$ is the iterative travel length. The difference between the iterative formulas for solving on-line control problems here and Monte Carlo method is mainly that the expression of harvesting $g_t$ is different, for the time difference, the expression of harvesting $g_t$ is $r+\gamma Q (S ', A ') $. This value function updates the Bertelsmann formula we discussed in intensive learning (v) using the sequential Difference method (TD) to solve section 2nd.

In addition to the expression of harvesting $g_t$, the SARSA algorithm and the Monte Carlo on-line control algorithm are basically similar.

3. Sarsa Algorithm Flow

Below we summarize the process of the SARSA algorithm.

Algorithm Input: Iteration number $t$, state set $s$, action set $a$, step $\alpha$, attenuation factor $\gamma$, exploratory rate $\epsilon$,

Output: The value of all States and actions corresponding to $q$

1. Randomly initialize all the states and actions corresponding to the value of $q$. Its $q$ value is initialized to 0 for the terminating state.

2. For I from 1 to T, iterate.

A) Initialize S is the first state of the current state sequence. Set $a$ to $\epsilon-$ greedy method in the current state $s$ the selected action.

b) Perform current action $a$ in state $s$, get new status $s ' $ and reward $r$

c) Use $\epsilon-$ greedy method in state $s ' $ select new action $ A ' $

d) Update value function $q (s,a) $:$ $Q (s,a) = Q (s,a) + \alpha (R+\gamma q (S ', A ')-Q (s,a)) $$

e) $S =s ', A=a ' $

f) If $s ' $ is the terminating state, the current round iteration is complete, otherwise go to step b)

One thing to note here is that the step $\alpha$ generally needs to be gradually smaller with the iteration, so that the action value function $q$ can converge. When the $q$ converges, our strategy $\epsilon-$ the greedy approach to convergence.

4. Sarsa Algorithm Example: Windy Gridworld

Below we use a well-known example windy Gridworld to study the SARSA algorithm.

such as a 10x7 rectangular lattice world, marked with a starting position S and a terminating target position G, the number below the lattice indicates a certain intensity of wind in the corresponding column. When an individual enters a grid in the column, the number of squares is automatically moved in the direction shown by the arrows in the figure, which simulates the world's stroke. The same lattice world has boundaries, and individuals can only be in a lattice within the world at any one time. The individual is not aware of the structure of the world and the wind, that is, it does not know that the lattice is rectangular, do not know where the boundary is, do not know their own inside the move after the next lattice and the previous lattice relative position relationship, of course, it is not clear the starting position, the specific location of the end target. But the individual remembers the past lattice, and the next time it enters the grid, it will be able to identify exactly when the lattice has come. The behavior that the lattice can perform is to move up, down, left, and right one step, each move one step as long as not to enter the target position to give a 1 penalty, until enter the target position to obtain the reward 0 and permanently stay in that position. The problem to solve now is how the individual should follow the strategy to reach the target from the starting position as soon as possible.

The logic is not complicated, the complete code is here. Here I mainly look at the key parts of the code.

In the 2nd step of the algorithm A, initialize the $s$, using the $\epsilon-$ greedy method in the current state $s$ the selected action process:

    # Initialize State    State = START    #  Choose a action based on Epsilon-greedy algorithm    if Np.random.binomial (1, EPSILON) = =1        := np.random.choice (ACTIONS    )Else :         = Q_value[state[0], state[1],:]        forif value_ = = Np.max (Values_)] )

In the algorithm 2nd step b, in the state $s$ perform the current action $a$, get the new state $s ' $ process, because the reward is not terminated is-1, do not need to calculate separately:

defStep (State, action): I, J= StateifAction = =action_up:return[Max (I-1-Wind[j], 0), J]elifAction = =Action_down:return[Max (min (i + 1-wind[j], world_height-1), 0), J]elifAction = =Action_left:return[Max (I-wind[j], 0), Max (j-1, 0)] elifAction = =Action_right:return[Max (I-wind[j], 0), Min (j + 1, world_width-1)]    Else:        assertFalse

The 2nd step in the algorithm C, with $\epsilon-$ greedy method in the state $s ' $ select the new action $ A ' process:

        Next_State = Step (state, action)        if np.random.binomial (1, EPSILON) = =1            :=  Np.random.choice (ACTIONS)        Else:            = q_value[next_state[0], next_state[1],:]              for inch if value_ = = Np.max (values_)])

In the 2nd step of the algorithm d,e, update the value function $q (s,a) $ and update the current state action process:

        # sarsa Update        Q_value[state[0], state[1], action] + =            \* (Reward + q_value[next_state[0], next_state[1], next_action]-< c6>                     q_value[state[0], state[1], action])        = next_state        = next_action

The code is very simple, I believe we control the algorithm, run the code, you can easily get the optimal solution of the problem, and then make clear the whole process of the SARSA algorithm.

5. Sarsa ($\lambda$)

In reinforcement learning (v) Using sequential difference method (TD) to solve the value function iterative method of multi-step sequential Difference $TD (\LAMBDA) $, then the corresponding multi-step sequential differential on-line control algorithm is our $sarsa (\LAMBDA) $.

$TD (\LAMBDA) $ has a forward and a back two value function iterations, of course they are equivalent. In the solution of control problem, the $SARSA (\LAMBDA) $ algorithm based on inverse cognition will be able to study online effectively, and the data can be discarded after learning. therefore $SARSA (\LAMBDA) $ algorithm defaults to iterative value functions based on the inverse.

In the previous article we talked about the reverse iteration of the $TD (\LAMBDA) $ state value function, namely: $$\delta_t = r_{t+1} + \gamma V (s_{t+1})-V (s_t) $$$ $V (s_t) = V (s_t) + \alpha\delta_t e_t (s) $$

An iterative formula for the corresponding action value function can be found in the sample, i.e.: $$\delta_t = r_{t+1} + \gamma Q (s_{t+1},a_{t+1})-Q (s_t, a_t) $$$ $Q (s_t, a_t) = Q (s_t, a_t) + \alph a\delta_te_t (s,a) $$

In addition to the state value function $ Q (s,a) $ in the update mode, the multi-step parameter $\lambda$ as well as the reverse realizations introduced by the utility Trace $e (s,a) $, the rest of the algorithm thought and Sarsa similar. Here we summarize the algorithm flow under $sarsa (\LAMBDA) $.

Algorithm Input: Iteration number $t$, state set $s$, action set $a$, step $\alpha$, attenuation factor $\gamma$, exploratory rate $\epsilon$, multi-step parameter $\lambda$

Output: The value of all States and actions corresponding to $q$

1. Randomly initialize all the states and actions corresponding to the value of $q$. Its $q$ value is initialized to 0 for the terminating state.

2. For I from 1 to T, iterate.

A) initializes the utility trace $e$ for all state actions to 0 and initializes s to the first state of the current state sequence. Set $a$ to $\epsilon-$ greedy method in the current state $s$ the selected action.

b) Perform current action $a$ in state $s$, get new status $s ' $ and reward $r$

c) Use $\epsilon-$ greedy method in state $s ' $ select new action $ A ' $

d) Update Utility trace function $e (S,A) $ and TD Error $\delta$: $ $E (s,a) = E (s,a) +1$$$$\delta= r_{t+1} + \gamma Q (s_{t+1},a_{t+1})-Q (s_t, a_t) $$

e) to the current sequence all occurrences of the state S and corresponding action A, update value function $q (s,a) $ and Utility trace function $e (s,a) $:$ $Q (s,a) = Q (s,a) + \alpha\delta E (s,a) $$$ $E (s,a) = \gamma\lambda E (s,a) $$

f) $S =s ', A=a ' $

g) If $s ' $ is the terminating state, the current round iteration is complete, otherwise go to step b)

For step $\alpha$, as with Sarsa, it is generally necessary to make the action value function $q$ convergent as the iteration becomes smaller.

6. Sarsa Summary

Compared with the dynamic programming method, the SARSA algorithm does not need the state transition model of the environment, compared with the Monte Carlo method, it does not need the complete state sequence, so it is more flexible. It is widely used in traditional reinforcement learning methods.

But the SARSA algorithm also has a common problem of traditional reinforcement learning methods, that is, it can't solve too complicated problems. In the SARSA algorithm, the value of $Q (s,a) $ is stored using a large table, and if our state and actions are millions or even tens, the large table that needs to be saved in memory is super large and even overflow, so it is not very suitable for solving large scale problems. Of course, for not particularly complex problems, the use of Sarsa is still a good way to solve the problem of reinforcement learning.

In the next chapter, we discuss Sarsa's sister algorithm, sequential difference separation line control algorithm q-learning.

(Welcome, please indicate the source.) Welcome to communicate: liujianping-ok@163.com)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.