We are now going to use the q-learning algorithm mentioned earlier to achieve an interesting thing 1. Algorithm effect
What we want to achieve is a car like this. The car has two movements, at any one time can move to the left, can also move to the right, our goal is to go to the top of the car. In the beginning, the car can only randomly move around, after training for a while will be able to complete the goals we set
2. Deep Q Learning Algorithm Simple Introduction
As we briefly introduced in the previous chapter, the algorithm we use is the simplest deep Q learning algorithm, which is shown in the following diagram
We can see that there are several main elements in this algorithm 1. Replay_buffer
We are constantly training in the system, will produce a lot of training data. Although these data are not the best strategy to deal with the environment at that time, but through the experience of interacting with the environment, this is very helpful to our training system. So we set up a replay_buffer, get new interactive data, discard old data, and each time we randomly pick a batch from this replay_buffer to train our system
Each record in the Replay_buffer contains these items: state, representing the status action that the system is facing at the time, representing the behavior that our agent is doing when it is confronted with the state of the system, reward the proceeds from the environment after the agent has made the chosen behavior. Next_State, which indicates that the agent has made a choice of behavior, the system is transferred to another State done, indicating that this epsiode has no end
We're going to use this state set to train my neural network.
This equal treatment of all data acquisition strategies seems to be not very effective, some of the data is obviously more useful (for example, those scoring data), so we can at this point to optimize him, is prioritized_replay_buffer, we will write a special article to introduce 2. Neural network
Why do we use neural networks here?
Because of the state of the system for a given moment, we need to estimate the amount of revenue that will be generated by each action in the state set S that we take in this state.
Then we can choose an action based on our established strategy, after comparing the benefits.
The input of the neural network is a state of the system
The output of a neural network is a set of states in which each action, in the current state, produces the value
The input is given by the system, the output is estimated by us, we use this output of the estimate to replace the previous output and step by step to optimize
With this data, we can optimize the neural network.
But after we get the value of each action, what kind of strategy should we take? In the basic q-learning algorithm, we take the simplest epsilon-greedy strategy 3. Epsilon_greedy
This strategy is simple, but effective, even better than many complex policy effects
Specific introduction can read this article https://zhuanlan.zhihu.com/p/21388070, we are here to briefly introduce
We set a threshold, epsilon-boundary, for example, the initial value is 0.8, meaning that when we select the action now, the probability of 80% is to randomly select an action from the action set, and the 20% probability is to compute the income of each action via the neural network, And pick the biggest one.
But as the learning process progresses, our epsilon-boundary are getting lower, the number of random choices is getting less, and the last few random choices are 3. Key Code Analysis
Q_value_batch = self. Q_value.eval (feed_dict = {Self.input_layer:next_state_batch}) for
i in Xrange (batch_size):
if Done_batch[i ]:
y_batch.append (reward_batch[i])
else:
y_batch.append (reward_batch[i] + GAMMA * Np.max (Q_value_batch [i]) )
As we mentioned before, the function of a neural network is to estimate the value of each action taken in the current state. Here, the input of the neural network is next_state, the output is the value of each action of Next_State, and the max of each action is considered to be the maximum value that next_state can achieve.
So what we've achieved here is the q-learning algorithm that we talked about.
V^{\PI} (S_0) =e[r (s_0) +\gamma V^{\pi} (s_1)]
If this state is the last state of the current episdoe, then the value is only instantaneous reward, if there is a state below, the reward equals the immediate reward, plus the value of the next state.
Self.optimizer.run (feed_dict = {
Self.input_layer:state_batch,
Self.action_input:action_batch,
Self.y_input:y_batch
})
And then we use the computed reward value to train the neural network.
Self. Q_value = Tf.matmul (Hidder_layer3, W4) + b4 self.action_input
= Tf.placeholder ("float", [None, Self.action_dim])
self.y_input = Tf.placeholder ("float", [None])
q_action = Tf.reduce_sum (Tf.mul (self. Q_value, self.action_input), reduction_indices = 1)
self.cost = Tf.reduce_mean (Tf.square (self.y_input-q_action) )
Self.optimizer = Tf.train.RMSPropOptimizer (0.00025,0.99,0.0,1e-6). Minimize (Self.cost)
This section involves the most basic operation of TensorFlow, unfamiliar students can read this article first
This code is the idea of implementation.
Q_value is the output of a neural network, a vector of [1*k], and K represents the number of actions.
Action_input is actually taking that action, but is the one_hot_action type, is the entire vector is 0, in addition to the operation of the index is 1, so easy to operate, as long as an inner product can be
And then we use the estimated value of the previous calculation as the real value to optimize the neural network.
You can specify the optimizer and related parameters yourself
If Self.epsilon >= Final_epsilon:
Self.epsilon-= (Initial_epsilon-final_epsilon)/10000
if Random.random ( ) < Self.epsilon: return
random.randint (0, self.action_dim-1)
else: return
self.get_greedy_action ( State)
This is the epsilon-greedy algorithm we mentioned earlier.
For episode in Xrange (episode): state
= Env.reset ()
total_reward = 0
debug_reward = 0 for step in
xrange (S TEP):
env.render ()
action = agent.get_action (state)
next_state, reward, done, _ = Env.step (action)
total_reward + = Reward
Agent.percieve (State, action, reward, Next_State, doing) state
= Next_State
if do NE: Break
This is the code of the main program, in each episode, we interact with the environment, collect the data generated by the interaction, and train the neural network.
Here are a few APIs that might need to explain
Next_State, reward, done, OB = Env.step (Action)
Here we give the environment an action, and the environment returns us to the next state that the action causes, whether the action causes the Reward,episdoe to end, and the last value returned is a observation, Is the amount that can be observed directly from the environment, although this amount is returned to us by the environment, but as an agent, because it is not God's perspective, it is not available 4. Full Code
Import TensorFlow as TF import numpy as NP Import gym import random from collections import deque Episdoe = 10000 Step = 10000 env_name = ' mountaincar-v0 ' batch_size = Init_epsilon = 1.0 Final_epsilon = 0.1 Replay_size = 50000 TRAIN_START_S ize = GAMMA = 0.9 def get_weights (shape): weights = tf.truncated_normal (shape = shape, stddev= 0.01) return Tf. Variable (weights) def get_bias (shape): bias = tf.constant (0.01, shape = shape) return TF. Variable (Bias) class DQN (): Def __init__ (self,env): Self.epsilon_step = (Init_epsilon-final_epsilon)/10 Self.action_dim = ENV.ACTION_SPACE.N print (env.observation_space) Self.state_dim = Env.obser Vation_space.shape[0] Self.neuron_num = Self.replay_buffer = Deque () Self.epsilon = Init_epsil On self.sess = tf. InteractiveSession () self.init_network () Self.sess.run (Tf.initialize_all_variables ()) def Init_ne
Twork (self): Self.input_layer = Tf.placeholder (Tf.float32, [None, Self.state_dim]) Self.action_input = Tf.placeholder ( Tf.float32, [None, Self.action_dim]) Self.y_input = Tf.placeholder (Tf.float32, [none]) W1 = get_weights ([Self.state_dim, self.neuron_num]) B1 = Get_bias ([self.neuron_num]) Hidden_layer = Tf.nn.relu (TF.MATM UL (Self.input_layer, W1) + b1) W2 = Get_weights ([Self.neuron_num, Self.action_dim]) b2 = Get_bias (
[Self.action_dim]) Self. Q_value = Tf.matmul (Hidden_layer, W2) + b2 value = Tf.reduce_sum (self. Q_value, self.action_input), reduction_indices = 1) self.cost = Tf.reduce_mean (Tf.square (value-self.y_input ) Self.optimizer = Tf.train.RMSPropOptimizer (0.00025,0.99,0.0,1e-6). Minimize (Self.cost) return D
EF Percieve (Self, State, action, reward, Next_State, done): One_hot_action = Np.zeros ([Self.action_dim]) one_hot_action[ Action] = 1 self.replay_buffer.append ([state, one_hot_action, reward, next_state, done]) if Len (s Elf.replay_buffer) > REPLAY_SIZE:self.replay_buffer.popleft () If Len (self.replay_buffer) > T RAIN_START_SIZE:self.train () def train (self): Mini_batch = Random.sample (Self.replay_buffer, BA
tch_size) State_batch = [data[0] for data in mini_batch] Action_batch = [data[1] for data in Mini_batch]
Reward_batch = [data[2] for data in mini_batch] Next_state_batch = [data[3] for data in Mini_batch] Done_batch = [data[4] for data in mini_batch] Y_batch = [] Next_state_reward = self. Q_value.eval (feed_dict = {Self.input_layer:next_state_batch}) for I in Range (batch_size): if done_batch[i]: y_batch.append (reward_batch[i)) Else:y_batch.append (re ward_batch[i] + GAMMA * Np.max (NEXt_state_reward[i]) Self.optimizer.run (feed_dict = {SELF.INPUT_LAYER:STATE_BATC H, Self.action_input:action_batch, self.y_input:y_batch}) r Eturn def get_greedy_action (self, state): value = self. Q_value.eval (feed_dict = {self.input_layer: [state]}) [0] return Np.argmax (value) def get_action (self , state): If Self.epsilon > FINAL_EPSILON:self.epsilon-= self.epsilon_step if Random.rand Om () < Self.epsilon:return random.randint (0, self.action_dim-1) else:return self. Get_greedy_action (state) def main (): env = Gym.make (env_name) Agent = DQN (env) to episode in range (Episdoe)
: Total_reward = 0 state = Env.reset () as step in range (step): Env.render () Action = agent.get_action (state) Next_State, reward, done, _ = Env.step (action) Total_reward + = Reward Agent.percieve (state, action, reward, Next_State, done If done:break state = next_state print ' Total reward this episode is: ', total_reward if __name__ = = "__main__": Main ()
reference materials
Https://zhuanlan.zhihu.com/p/21477488?refer=intelligentunit