The reinforcement study guess who I am---deep q-network ^_^

Source: Internet
Author: User

How do deep q-network and q-learning look so alike, do they matter?

Yes, deep q-network is actually a way for q-learning to fuse a neural network.

This time we are using an example of a flight to explain the deep q-network, what to fly? Hehe, let's keep looking.

Brief

Deep q-network abbreviation dqn

What is the function of the neural network, in the q-learning we use the Q table to record the experience, through the neural network we do not need the Q table, when we put the State and action into the neural network, after the analysis of the neural network wait for action, In a complex environment, our machines may not be able to withstand such a large Q-meter, all of which require a neural network, a good helper.

is not found the similarities, it is actually only in the q-learning of the base of the addition of some small things,

    • Memory library (for repetitive learning)
    • Neural network calculates Q value
    • Temporarily freeze q_target parameters (disconnect correlation)

The core of the DQN is the memory library, which records all the steps that have gone through and then learns over and over again.

Game Start

First we build the environment, in the gym environment we create a game of flying

Env=gym.make ('beamrider-ram-v0')

Heavy attack, followed by our core, the realization of DQN

First we initialize the parameters of the DQN

    def __init__ (self): self        . ALPHA=0.001 self        . GAMMA=0.95 self        . Esplion=1.0 self        . Esplion_decay=0.99 self        . Esplion_min=0.0001        self.action_size=ENV.ACTION_SPACE.N        self.state_size= env.observation_space.shape[0]        self.model=Self._build_model ()        self.memory= Deque (maxlen=5000)

Children's shoes are not found more than two parameters, model and Memory,model is our neural network model, and Memroy Yes, our memory.

We create a simple neural network model, but this neural network model is empty and used for our presentation, where I use the kears

    def_build_model (self): Model=sequential () model.add (Dense (Input_dim=self.state_size, activation='Relu')) Model.add (Dense (24,activation='Relu')) Model.add (Dense (self.action_size,activation='Linear')) Model.compile (loss='MSE', Optimizer=adam (lr=Self . ALPHA))returnModel

Through the neural network, we get our next move.

    def choose_action (self,obervation):         if np.random.uniform () < self. Esplion:            return  env.action_space.sample ()
#经过神经网络得到action Action=self.model.predict (obervation) return Np.argmax (action[0])

Start adding our experience to the memory store

    def update_memory (self,obervation,action,reward,obervation_,done):        self.memory.append ((obervation,action , Reward,obervation_,done))

The training is different, we get some experience in the memory library and then we learn

    defLearn (self):ifLen (self.memory) <batch_size:returnMinibach=random.sample (self.memory,batch_size) forState,action,reward,next_state,doneinchMinibach:target=Rewardif  notDone : target =reward+self. gamma* Np.amax (self.model.predict (Next_State) [0]) target_f=self.model.predict (state) target_f[0][action]=Target Self.model.fit (State,target_f,epochs=1,verbose=0)ifSelf. Esplion>Self . Esplion_min:self. Esplion*=self. Esplion_decay

In order to have a good score in the future, we not only have to look at the eyes of the long-term vision, here and Q-learning,sarsa , I was in the top red marked out

We found two more parameters,Esplion_min,esplion_dacay, what do we do with this, in order to let our program not always explore we set these two parameters,Esplion_decay used to By reducing our esplion value ,esplion_min is used to specify the minimum number of probes we have , and when we are below this we do not DECAY again.

We run this program

At first, our fighter had been killed, and gradually it learned to shoot down enemy planes.

After a period of fighting, the fighters became more and more powerful.

Double DQN

DQN also has many variants, one of which is Double dqn

Because of the existence of Qmax in q-learning, It is because of the existence of Qmax overestimate ( over-estimated ), if DQN found after the output of the neural network Q-value is very large, This is overestimate .

This is the Q implementation in the original DQN

This is the Q implementation in double DQN

So we put the Q-estimated action in the Q implementation to predict the action we want to choose, so that we can prevent overestimate

All when initializing parameters, we add a neural network model with the same structure

When we use the dqn we find that it is not very stable and relatively stable when using DDQN.

Game over

Here are all the code, the little friends can try

#Coding:utf-8 fromKeras.modelsImportSequential fromKeras.layersImportDense fromKeras.optimizersImportAdam fromKeras.utilsImportPlot_modelImportNumPy as NPImportGymImportRandom fromCollectionsImportdeque fromKeras.callbacksImportCallbackImportMatplotlib.pyplot as Pltbatch_size=32Losses=[]classlosshistory (Callback):defOn_batch_end (self, batch, logs=None): Losses.append (Logs.get ('Loss'))classAgent (object):def __init__(self): self. ALPHA=0.001Self . GAMMA=0.95Self . Esplion=1.0Self . Esplion_decay=0.99Self . Esplion_min=0.001self.action_size=ENV.ACTION_SPACE.N self.state_size=Env.observation_space.shape[0] Self.memory=deque (maxlen=5000) Self.model=Self._build_model ()def_build_model (self): Model=sequential () model.add (Dense (Input_dim=self.state_size, activation='Relu')) Model.add (Dense (24,activation='Relu')) Model.add (Dense (self.action_size,activation='Linear')) Model.compile (loss='MSE', Optimizer=adam (lr=Self . ALPHA))returnModeldefchoose_action (self,obervation):ifNp.random.uniform () <Self . Esplion:returnenv.action_space.sample () Action=self.model.predict (obervation)returnNp.argmax (action[0])defupdate_memory (Self,obervation,action,reward,obervation_,done): Self.memory.append ((Obervation,action,reward , Obervation_,done))defPlot_model (self): Plot_model (Self.model, To_file='./save_graph/model.png')classddqnagent (Agent):def __init__(self): Super (ddqnagent,self).__init__() Self.target_model=Self._build_model () Self.update_target_model ( )defUpdate_target_model (self): Self.target_model.set_weights (Self.model.get_weights ())defLearn (self):ifLen (self.memory) <batch_size:returnMinibach=random.sample (self.memory,batch_size) forState,action,reward,next_state,doneinchMinibach:target=self.model.predict (state)ifDone:target[0][action]=RewardElse: Old_model=self.model.predict (Next_State) [0] New_model=self.target_model.predict (Next_State) [0] target[0][action]=reward+self. gamma*New_model[np.argmax (Old_model)] Self.model.fit (State,target,epochs=1,verbose=0)ifSelf. Esplion>Self . Esplion_min:self. Esplion*=Self . Esplion_decayclassdqnagent (Agent):defLearn (self):ifLen (self.memory) <batch_size:returnMinibach=random.sample (self.memory,batch_size) forState,action,reward,next_state,doneinchMinibach:target=Rewardif  notDone:target=reward+self. gamma*Np.amax (Self.model.predict (Next_State) [0]) Target_f=self.model.predict (state) target_f[0][action]=Target Self.model.fit (State,target_f,epochs=1,verbose=0)ifSelf. Esplion>Self . Esplion_min:self. Esplion*=Self . Esplion_decayhistory_loss=losshistory () env=gym.make ('Beamrider-ram-v0')#agent=dqnagent ()Agent=ddqnagent () totcal=0 forEinchRange (50001): Obervation=env.reset () obervation=np.reshape (obervation,[1, Agent.state_size]) Done=False Index=0 while  notDone :#Env.render ()action=agent.choose_action (obervation) Obervation_,reward,done,info=Env.step (action) Obervation_= Np.reshape (Obervation_, [1, Agent.state_size]) Reward=-10ifDoneElsereward Agent.update_memory (obervation,action,reward,obervation_,done) obervation=Obervation_ Index+=1totcal+=RewardifDone:agent.update_target_model ()#If Len (losses)!=0:        #Plt.plot (Range (len (losses)), losses)        #plt.savefig ('./save_graph/loss.png ')    ife%50==0:agent.model.save ('./airraid_model.h5') Agent.learn () Agent.plot_model ()Print 'esp {},reward {} Espilon {}'. Format (e,totcal/index,agent. Esplion)

This article is intended to lead small partners into the temple of intensive learning ^_^

The reinforcement study guess who I am---deep q-network ^_^

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.