Introduction
The previous one is about Monte Carlo's reinforcement learning method, Monte Carlo reinforcement Learning algorithm overcomes the difficulty of model unknown to strategy estimation by considering the sampling trajectory, but the Monte Carlo method has the disadvantage that it is necessary to update the strategy after sampling a trajectory every time. The Monte Carlo method does not make full use of the MDP structure of the learning task, while the sequential difference learning method Temporal difference (TD) makes full use of the MDP structure, which is more efficient than MC, this article introduces the TD algorithm SARSA algorithm
The SARSA algorithm is as follows:
Sarsa algorithm is the On-policy method, its original strategy and update strategy is consistent, and its update strategy and MC is not the same as its policy update does not need to sample a complete trajectory, after performing an action can update its value function. q-learning Algorithm
Q-learning algorithm is a Off-policy method, its original strategy and value function Update strategy inconsistent, the same does not need to sample a trajectory for policy updates, and SARSA algorithm is not the same, q-learning in the update value function is used when the greedy strategy , rather than the ϵϵ\epsilon-greedy policy. the difference between the two
To compare the difference between the SARSA algorithm and the q-learning algorithm, we use a specific problem to show. The problem is described as follows:
as shown in the figure above, there is a 12*4 lattice, starting from the lower left corner, the direction can be up or down (not more than square), go to the bottom right corner, the end of the game, where the bottom line has 10 marked areas, once you go to this area, will return to the original point and continue the game.
Game Analysis:
1. The reward value of the user walking to the labeled area is-100, and the remaining area reward is 1
2. The state of the game is the user's two coordinates x,y
This allows you to plan the scene of the game environment: starting with the start point S to establish the coordinates, the starting point is set to (0,0), the x-axis coordinates to the right, and the y axis coordinates up. The code is built as follows:
Class Cliffenvironment (object): Def __init__ (self): self.width=12 self.height=4 self.move=[[0 , 1],[0,-1],[-1,0],[1,0]] #up, down,left,right self.na=4 self._reset () def _reset (self): self
. x=0 self.y=0 self.end_x=11 self.end_y=0 self.done=false def observation (self):
Return tuple ((SELF.X,SELF.Y)), Self.done def clip (self,x,y): x = max (x,0) x = min (x,self.width-1) y = max (y,0) y = min (y,self.height-1) return x,y def _step (self,action): Self.done = F Alse self.x+=self.move[action][0] self.y+=self.move[action][1] Self.x,self.y=self.clip (self.x,self . y) if Self.x>=1 and self.x<=10 and self.y==0:reward=-100 Self._reset () el If Self.x==self.width-1 and Self.y==0:reward=0 self.is_destination=true Self.done=tru
E Else: Reward=-1 return tuple ((SELF.X,SELF.Y)), Reward,self.done
Where _step represents a step in accordance with the action, _reset is to restore the start of the layout, observation return to the current state. Sarsa Algorithm
The code to update the policy using the SARSA algorithm is as follows:
def sarsa (env,episode_nums,discount_factor=1.0, alpha=0.5,epsilon=0.1): env = cliffenvironment () Q = Defaultdict (
Lambda:np.zeros (ENV.NA)) rewards=[] for I_episode in range (1,1+episode_nums): if i_episode% 1000 = 0: Print ("\repisode {}/{}."). Format (I_episode, episode_nums)) Sys.stdout.flush () env._reset () State,done = Env.observation () a= Epsilon_greedy_policy (q,state,env.na) probs = A action = Np.random.choice (Np.arange (env.na),
P=probs) sum_reward=0.0 While not Done:new_state,reward,done = Env._step (action)
If done:q[state][action]=q[state][action]+alpha* (Reward+discount_factor*0.0-q[state][action]) Break else:new_a = Epsilon_greedy_policy (q,new_state,env.na) probs = New_a new_action = Np.random.choice (Np.arange (env.na), p=probs) q[state][action]=q[state][action]+alpha* (reward+discount_factor*q[new_state][new_action]-q[state][action]) state = new_s Tate action=new_action Sum_reward+=reward rewards.append (sum_reward) return Q,rewa Rds
Where both the original policy and the q[state][action] update policy use the Εϵ\epsilon-greedy algorithm, after 1000 iterations, we get the update strategy:
Right, right, right, right, right, right. T up, right, down, up, up, up, up, up, left, up Right down, up, up, up, up, up, up Up
Where up to express, down to the left and right respectively, you can see that the SARSA algorithm is the most secure path, that is, the first figure in the "safe path", we iterate 10 times Sarsa algorithm , each SARSA algorithm uses 1000 episode to mean the reward of each episode of 10 iterations, and then draws its episode chart according to the value of every 10 reward sample:
q-learning Algorithm
The code that uses the Q-learning update algorithm is as follows:
def q_learning (env,episode_nums,discount_factor=1.0, alpha=0.5,epsilon=0.1): env = cliffenvironment () Q = default Dict (Lambda:np.zeros (env.na)) rewards=[] for I_episode in range (1,1+episode_nums): if i_episode% 1000 = = 0:print ("\repisode {}/{}."). Format (I_episode, episode_nums)) Sys.stdout.flush () env._reset () State,done = Env.observation () sum_reward=0.0 While not done:a= epsilon_greedy_policy (q,state,env.na) prob s = A action = Np.random.choice (Np.arange (env.na), p=probs) New_state,reward,done = Env._step (actio N) if done:q[state][action]=q[state][action]+alpha* (Reward+discount_factor*0.0-q[state][ac tion]) Break else:new_action = Greedy_policy (q,new_state) q[s
tate][action]=q[state][action]+alpha* (Reward+discount_factor*q[new_state][new_action]-q[state][action]) State = New_state Sum_reward+=reward rewards.append (sum_reward) return q,rewards
It can be seen that the q-learning original strategy is a ϵϵ\epsilon-greedy strategy, and the update value function is used in the greedy strategy, obviously q-learning is a off-policy algorithm. As with the SARSA algorithm, we iterate over 1000 episode to get its strategy:
It's right, right, right, right. />right down, right, right, right, right, right, right, right, right, right.
up< C9/>up up/up/up/up/up
The resulting strategy is consistent with the "optimal path" in the first diagram, which means that q-learning tends to choose the optimal strategy, but sometimes q-learning gets caught in the labeled area, This can be seen from the q-leanring return reward value, the drawing method and the SARSA algorithm are consistent, the resulting figure is:
As can be seen from the graph, the reward value of q-learning is lower than the SARSA algorithm because q-learning is occasionally caught in a trap when choosing the optimal strategy (because the original policy exists reward Epsilon-greedy method).
Although q-learning is actually learning the value of an optimal strategy, its online performance is worse than the sarsa of learning a roundabout strategy. Of course, if the Εϵ\epsilon gradually become smaller, both methods can finally converge to the optimal strategy.
All the code in this article can be found here