Reinforcement Learning Intensive Learning Series IV: Sequential differential td_

Reinforcement Learning Intensive Learning Series IV: Sequential differential td__ Intensive learning

Last Update:2018-08-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

The previous one is about Monte Carlo's reinforcement learning method, Monte Carlo reinforcement Learning algorithm overcomes the difficulty of model unknown to strategy estimation by considering the sampling trajectory, but the Monte Carlo method has the disadvantage that it is necessary to update the strategy after sampling a trajectory every time. The Monte Carlo method does not make full use of the MDP structure of the learning task, while the sequential difference learning method Temporal difference (TD) makes full use of the MDP structure, which is more efficient than MC, this article introduces the TD algorithm SARSA algorithm

The SARSA algorithm is as follows:

Sarsa algorithm is the On-policy method, its original strategy and update strategy is consistent, and its update strategy and MC is not the same as its policy update does not need to sample a complete trajectory, after performing an action can update its value function. q-learning Algorithm

Q-learning algorithm is a Off-policy method, its original strategy and value function Update strategy inconsistent, the same does not need to sample a trajectory for policy updates, and SARSA algorithm is not the same, q-learning in the update value function is used when the greedy strategy , rather than the ϵϵ\epsilon-greedy policy. the difference between the two

To compare the difference between the SARSA algorithm and the q-learning algorithm, we use a specific problem to show. The problem is described as follows:

as shown in the figure above, there is a 12*4 lattice, starting from the lower left corner, the direction can be up or down (not more than square), go to the bottom right corner, the end of the game, where the bottom line has 10 marked areas, once you go to this area, will return to the original point and continue the game.

Game Analysis:
1. The reward value of the user walking to the labeled area is-100, and the remaining area reward is 1
2. The state of the game is the user's two coordinates x,y

This allows you to plan the scene of the game environment: starting with the start point S to establish the coordinates, the starting point is set to (0,0), the x-axis coordinates to the right, and the y axis coordinates up. The code is built as follows:

Class Cliffenvironment (object): Def __init__ (self): self.width=12 self.height=4 self.move=[[0 , 1],[0,-1],[-1,0],[1,0]] #up, down,left,right self.na=4 self._reset () def _reset (self): self
        . x=0 self.y=0 self.end_x=11 self.end_y=0 self.done=false def observation (self):
        Return tuple ((SELF.X,SELF.Y)), Self.done def clip (self,x,y): x = max (x,0) x = min (x,self.width-1) y = max (y,0) y = min (y,self.height-1) return x,y def _step (self,action): Self.done = F Alse self.x+=self.move[action][0] self.y+=self.move[action][1] Self.x,self.y=self.clip (self.x,self . y) if Self.x>=1 and self.x<=10 and self.y==0:reward=-100 Self._reset () el If Self.x==self.width-1 and Self.y==0:reward=0 self.is_destination=true Self.done=tru
        E Else:    Reward=-1 return tuple ((SELF.X,SELF.Y)), Reward,self.done

Where _step represents a step in accordance with the action, _reset is to restore the start of the layout, observation return to the current state. Sarsa Algorithm

The code to update the policy using the SARSA algorithm is as follows:

def sarsa (env,episode_nums,discount_factor=1.0, alpha=0.5,epsilon=0.1): env = cliffenvironment () Q = Defaultdict (
            Lambda:np.zeros (ENV.NA)) rewards=[] for I_episode in range (1,1+episode_nums): if i_episode% 1000 = 0: Print ("\repisode {}/{}."). Format (I_episode, episode_nums)) Sys.stdout.flush () env._reset () State,done = Env.observation () a= Epsilon_greedy_policy (q,state,env.na) probs = A action = Np.random.choice (Np.arange (env.na),

            P=probs) sum_reward=0.0 While not Done:new_state,reward,done = Env._step (action)
                If done:q[state][action]=q[state][action]+alpha* (Reward+discount_factor*0.0-q[state][action]) Break else:new_a = Epsilon_greedy_policy (q,new_state,env.na) probs = New_a new_action = Np.random.choice (Np.arange (env.na), p=probs) q[state][action]=q[state][action]+alpha* (reward+discount_factor*q[new_state][new_action]-q[state][action]) state = new_s Tate action=new_action Sum_reward+=reward rewards.append (sum_reward) return Q,rewa Rds

Where both the original policy and the q[state][action] update policy use the Εϵ\epsilon-greedy algorithm, after 1000 iterations, we get the update strategy:

Right, right, right, right, right, right.    T up, right, down, up, up, up, up, up, left, up  Right down, up, up, up, up, up, up    Up

Where up to express, down to the left and right respectively, you can see that the SARSA algorithm is the most secure path, that is, the first figure in the "safe path", we iterate 10 times Sarsa algorithm , each SARSA algorithm uses 1000 episode to mean the reward of each episode of 10 iterations, and then draws its episode chart according to the value of every 10 reward sample:
q-learning Algorithm

The code that uses the Q-learning update algorithm is as follows:

def q_learning (env,episode_nums,discount_factor=1.0, alpha=0.5,epsilon=0.1): env = cliffenvironment () Q = default Dict (Lambda:np.zeros (env.na)) rewards=[] for I_episode in range (1,1+episode_nums): if i_episode% 1000 = = 0:print ("\repisode {}/{}."). Format (I_episode, episode_nums)) Sys.stdout.flush () env._reset () State,done = Env.observation () sum_reward=0.0 While not done:a= epsilon_greedy_policy (q,state,env.na) prob s = A action = Np.random.choice (Np.arange (env.na), p=probs) New_state,reward,done = Env._step (actio N) if done:q[state][action]=q[state][action]+alpha* (Reward+discount_factor*0.0-q[state][ac tion]) Break else:new_action = Greedy_policy (q,new_state) q[s
tate][action]=q[state][action]+alpha* (Reward+discount_factor*q[new_state][new_action]-q[state][action])                State = New_state Sum_reward+=reward rewards.append (sum_reward) return q,rewards

It can be seen that the q-learning original strategy is a ϵϵ\epsilon-greedy strategy, and the update value function is used in the greedy strategy, obviously q-learning is a off-policy algorithm. As with the SARSA algorithm, we iterate over 1000 episode to get its strategy:

It's right, right, right, right.  />right down, right, right, right, right, right, right, right, right, right.  
up< C9/>up up/up/up/up/up

The resulting strategy is consistent with the "optimal path" in the first diagram, which means that q-learning tends to choose the optimal strategy, but sometimes q-learning gets caught in the labeled area, This can be seen from the q-leanring return reward value, the drawing method and the SARSA algorithm are consistent, the resulting figure is:

As can be seen from the graph, the reward value of q-learning is lower than the SARSA algorithm because q-learning is occasionally caught in a trap when choosing the optimal strategy (because the original policy exists reward Epsilon-greedy method).

Although q-learning is actually learning the value of an optimal strategy, its online performance is worse than the sarsa of learning a roundabout strategy. Of course, if the Εϵ\epsilon gradually become smaller, both methods can finally converge to the optimal strategy.

All the code in this article can be found here

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Reinforcement Learning Intensive Learning Series IV: Sequential differential td__ Intensive learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Reinforcement Learning Intensive Learning Series IV: Sequential differential td__ Intensive learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support