How do deep q-network and q-learning look so alike, do they matter?
Yes, deep q-network is actually a way for q-learning to fuse a neural network.
This time we are using an example of a flight to explain the deep q-network, what to fly? Hehe, let's keep looking.
Brief
Deep q-network abbreviation dqn
What is the function of the neural network, in the q-learning we use the Q table to record the experience, through the neural network we do not need the Q table, when we put the State and action into the neural network, after the analysis of the neural network wait for action, In a complex environment, our machines may not be able to withstand such a large Q-meter, all of which require a neural network, a good helper.
is not found the similarities, it is actually only in the q-learning of the base of the addition of some small things,
- Memory library (for repetitive learning)
- Neural network calculates Q value
- Temporarily freeze
q_target parameters (disconnect correlation)
The core of the DQN is the memory library, which records all the steps that have gone through and then learns over and over again.
Game Start
First we build the environment, in the gym environment we create a game of flying
Env=gym.make ('beamrider-ram-v0')
Heavy attack, followed by our core, the realization of DQN
First we initialize the parameters of the DQN
def __init__ (self): self . ALPHA=0.001 self . GAMMA=0.95 self . Esplion=1.0 self . Esplion_decay=0.99 self . Esplion_min=0.0001 self.action_size=ENV.ACTION_SPACE.N self.state_size= env.observation_space.shape[0] self.model=Self._build_model () self.memory= Deque (maxlen=5000)
Children's shoes are not found more than two parameters, model and Memory,model is our neural network model, and Memroy Yes, our memory.
We create a simple neural network model, but this neural network model is empty and used for our presentation, where I use the kears
def_build_model (self): Model=sequential () model.add (Dense (Input_dim=self.state_size, activation='Relu')) Model.add (Dense (24,activation='Relu')) Model.add (Dense (self.action_size,activation='Linear')) Model.compile (loss='MSE', Optimizer=adam (lr=Self . ALPHA))returnModel
Through the neural network, we get our next move.
def choose_action (self,obervation): if np.random.uniform () < self. Esplion: return env.action_space.sample ()
#经过神经网络得到action Action=self.model.predict (obervation) return Np.argmax (action[0])
Start adding our experience to the memory store
def update_memory (self,obervation,action,reward,obervation_,done): self.memory.append ((obervation,action , Reward,obervation_,done))
The training is different, we get some experience in the memory library and then we learn
defLearn (self):ifLen (self.memory) <batch_size:returnMinibach=random.sample (self.memory,batch_size) forState,action,reward,next_state,doneinchMinibach:target=Rewardif notDone : target =reward+self. gamma* Np.amax (self.model.predict (Next_State) [0]) target_f=self.model.predict (state) target_f[0][action]=Target Self.model.fit (State,target_f,epochs=1,verbose=0)ifSelf. Esplion>Self . Esplion_min:self. Esplion*=self. Esplion_decay
In order to have a good score in the future, we not only have to look at the eyes of the long-term vision, here and Q-learning,sarsa , I was in the top red marked out
We found two more parameters,Esplion_min,esplion_dacay, what do we do with this, in order to let our program not always explore we set these two parameters,Esplion_decay used to By reducing our esplion value ,esplion_min is used to specify the minimum number of probes we have , and when we are below this we do not DECAY again.
We run this program
At first, our fighter had been killed, and gradually it learned to shoot down enemy planes.
After a period of fighting, the fighters became more and more powerful.
Double DQN
DQN also has many variants, one of which is Double dqn
Because of the existence of Qmax in q-learning, It is because of the existence of Qmax overestimate ( over-estimated ), if DQN found after the output of the neural network Q-value is very large, This is overestimate .
This is the Q implementation in the original DQN
This is the Q implementation in double DQN
So we put the Q-estimated action in the Q implementation to predict the action we want to choose, so that we can prevent overestimate
All when initializing parameters, we add a neural network model with the same structure
When we use the dqn we find that it is not very stable and relatively stable when using DDQN.
Game over
Here are all the code, the little friends can try
#Coding:utf-8 fromKeras.modelsImportSequential fromKeras.layersImportDense fromKeras.optimizersImportAdam fromKeras.utilsImportPlot_modelImportNumPy as NPImportGymImportRandom fromCollectionsImportdeque fromKeras.callbacksImportCallbackImportMatplotlib.pyplot as Pltbatch_size=32Losses=[]classlosshistory (Callback):defOn_batch_end (self, batch, logs=None): Losses.append (Logs.get ('Loss'))classAgent (object):def __init__(self): self. ALPHA=0.001Self . GAMMA=0.95Self . Esplion=1.0Self . Esplion_decay=0.99Self . Esplion_min=0.001self.action_size=ENV.ACTION_SPACE.N self.state_size=Env.observation_space.shape[0] Self.memory=deque (maxlen=5000) Self.model=Self._build_model ()def_build_model (self): Model=sequential () model.add (Dense (Input_dim=self.state_size, activation='Relu')) Model.add (Dense (24,activation='Relu')) Model.add (Dense (self.action_size,activation='Linear')) Model.compile (loss='MSE', Optimizer=adam (lr=Self . ALPHA))returnModeldefchoose_action (self,obervation):ifNp.random.uniform () <Self . Esplion:returnenv.action_space.sample () Action=self.model.predict (obervation)returnNp.argmax (action[0])defupdate_memory (Self,obervation,action,reward,obervation_,done): Self.memory.append ((Obervation,action,reward , Obervation_,done))defPlot_model (self): Plot_model (Self.model, To_file='./save_graph/model.png')classddqnagent (Agent):def __init__(self): Super (ddqnagent,self).__init__() Self.target_model=Self._build_model () Self.update_target_model ( )defUpdate_target_model (self): Self.target_model.set_weights (Self.model.get_weights ())defLearn (self):ifLen (self.memory) <batch_size:returnMinibach=random.sample (self.memory,batch_size) forState,action,reward,next_state,doneinchMinibach:target=self.model.predict (state)ifDone:target[0][action]=RewardElse: Old_model=self.model.predict (Next_State) [0] New_model=self.target_model.predict (Next_State) [0] target[0][action]=reward+self. gamma*New_model[np.argmax (Old_model)] Self.model.fit (State,target,epochs=1,verbose=0)ifSelf. Esplion>Self . Esplion_min:self. Esplion*=Self . Esplion_decayclassdqnagent (Agent):defLearn (self):ifLen (self.memory) <batch_size:returnMinibach=random.sample (self.memory,batch_size) forState,action,reward,next_state,doneinchMinibach:target=Rewardif notDone:target=reward+self. gamma*Np.amax (Self.model.predict (Next_State) [0]) Target_f=self.model.predict (state) target_f[0][action]=Target Self.model.fit (State,target_f,epochs=1,verbose=0)ifSelf. Esplion>Self . Esplion_min:self. Esplion*=Self . Esplion_decayhistory_loss=losshistory () env=gym.make ('Beamrider-ram-v0')#agent=dqnagent ()Agent=ddqnagent () totcal=0 forEinchRange (50001): Obervation=env.reset () obervation=np.reshape (obervation,[1, Agent.state_size]) Done=False Index=0 while notDone :#Env.render ()action=agent.choose_action (obervation) Obervation_,reward,done,info=Env.step (action) Obervation_= Np.reshape (Obervation_, [1, Agent.state_size]) Reward=-10ifDoneElsereward Agent.update_memory (obervation,action,reward,obervation_,done) obervation=Obervation_ Index+=1totcal+=RewardifDone:agent.update_target_model ()#If Len (losses)!=0: #Plt.plot (Range (len (losses)), losses) #plt.savefig ('./save_graph/loss.png ') ife%50==0:agent.model.save ('./airraid_model.h5') Agent.learn () Agent.plot_model ()Print 'esp {},reward {} Espilon {}'. Format (e,totcal/index,agent. Esplion)
This article is intended to lead small partners into the temple of intensive learning ^_^
The reinforcement study guess who I am---deep q-network ^_^