Enhanced Learning Series (II): A simple example of enhanced learning

Last Update:2018-07-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We are now going to use the q-learning algorithm mentioned earlier to achieve an interesting thing 1. Algorithm effect

What we want to achieve is a car like this. The car has two movements, at any one time can move to the left, can also move to the right, our goal is to go to the top of the car. In the beginning, the car can only randomly move around, after training for a while will be able to complete the goals we set
2. Deep Q Learning Algorithm Simple Introduction

As we briefly introduced in the previous chapter, the algorithm we use is the simplest deep Q learning algorithm, which is shown in the following diagram

We can see that there are several main elements in this algorithm 1. Replay_buffer

We are constantly training in the system, will produce a lot of training data. Although these data are not the best strategy to deal with the environment at that time, but through the experience of interacting with the environment, this is very helpful to our training system. So we set up a replay_buffer, get new interactive data, discard old data, and each time we randomly pick a batch from this replay_buffer to train our system

Each record in the Replay_buffer contains these items: state, representing the status action that the system is facing at the time, representing the behavior that our agent is doing when it is confronted with the state of the system, reward the proceeds from the environment after the agent has made the chosen behavior. Next_State, which indicates that the agent has made a choice of behavior, the system is transferred to another State done, indicating that this epsiode has no end

We're going to use this state set to train my neural network.

This equal treatment of all data acquisition strategies seems to be not very effective, some of the data is obviously more useful (for example, those scoring data), so we can at this point to optimize him, is prioritized_replay_buffer, we will write a special article to introduce 2. Neural network

Why do we use neural networks here?

Because of the state of the system for a given moment, we need to estimate the amount of revenue that will be generated by each action in the state set S that we take in this state.

Then we can choose an action based on our established strategy, after comparing the benefits.

The input of the neural network is a state of the system

The output of a neural network is a set of states in which each action, in the current state, produces the value

The input is given by the system, the output is estimated by us, we use this output of the estimate to replace the previous output and step by step to optimize

With this data, we can optimize the neural network.

But after we get the value of each action, what kind of strategy should we take? In the basic q-learning algorithm, we take the simplest epsilon-greedy strategy 3. Epsilon_greedy

This strategy is simple, but effective, even better than many complex policy effects

Specific introduction can read this article https://zhuanlan.zhihu.com/p/21388070, we are here to briefly introduce

We set a threshold, epsilon-boundary, for example, the initial value is 0.8, meaning that when we select the action now, the probability of 80% is to randomly select an action from the action set, and the 20% probability is to compute the income of each action via the neural network, And pick the biggest one.

But as the learning process progresses, our epsilon-boundary are getting lower, the number of random choices is getting less, and the last few random choices are 3. Key Code Analysis

Q_value_batch = self. Q_value.eval (feed_dict = {Self.input_layer:next_state_batch}) for
i in Xrange (batch_size):
    if Done_batch[i ]:
        y_batch.append (reward_batch[i])
    else:
        y_batch.append (reward_batch[i] + GAMMA * Np.max (Q_value_batch [i])  )

As we mentioned before, the function of a neural network is to estimate the value of each action taken in the current state. Here, the input of the neural network is next_state, the output is the value of each action of Next_State, and the max of each action is considered to be the maximum value that next_state can achieve.

So what we've achieved here is the q-learning algorithm that we talked about.

V^{\PI} (S_0) =e[r (s_0) +\gamma V^{\pi} (s_1)]

If this state is the last state of the current episdoe, then the value is only instantaneous reward, if there is a state below, the reward equals the immediate reward, plus the value of the next state.

Self.optimizer.run (feed_dict = {

    Self.input_layer:state_batch,
    Self.action_input:action_batch,
    Self.y_input:y_batch

    })

And then we use the computed reward value to train the neural network.

Self. Q_value = Tf.matmul (Hidder_layer3, W4) + b4 self.action_input
= Tf.placeholder ("float", [None, Self.action_dim]) 
  self.y_input = Tf.placeholder ("float", [None])
q_action = Tf.reduce_sum (Tf.mul (self.  Q_value, self.action_input), reduction_indices = 1)
self.cost = Tf.reduce_mean (Tf.square (self.y_input-q_action) )
Self.optimizer = Tf.train.RMSPropOptimizer (0.00025,0.99,0.0,1e-6). Minimize (Self.cost)

This section involves the most basic operation of TensorFlow, unfamiliar students can read this article first

This code is the idea of implementation.

Q_value is the output of a neural network, a vector of [1*k], and K represents the number of actions.
Action_input is actually taking that action, but is the one_hot_action type, is the entire vector is 0, in addition to the operation of the index is 1, so easy to operate, as long as an inner product can be

And then we use the estimated value of the previous calculation as the real value to optimize the neural network.

You can specify the optimizer and related parameters yourself

If Self.epsilon >= Final_epsilon:
    Self.epsilon-= (Initial_epsilon-final_epsilon)/10000



if Random.random ( ) < Self.epsilon: return
    random.randint (0, self.action_dim-1)
else: return
    self.get_greedy_action ( State)

This is the epsilon-greedy algorithm we mentioned earlier.

For episode in Xrange (episode): state
    = Env.reset ()
    total_reward = 0
    debug_reward = 0 for step in
    xrange (S TEP):
        env.render ()
        action = agent.get_action (state)

        next_state, reward, done, _ = Env.step (action) 
  
   total_reward + = Reward
        Agent.percieve (State, action, reward, Next_State, doing) state
        = Next_State

        if do NE: Break

This is the code of the main program, in each episode, we interact with the environment, collect the data generated by the interaction, and train the neural network.

Here are a few APIs that might need to explain

Next_State, reward, done, OB = Env.step (Action)

Here we give the environment an action, and the environment returns us to the next state that the action causes, whether the action causes the Reward,episdoe to end, and the last value returned is a observation, Is the amount that can be observed directly from the environment, although this amount is returned to us by the environment, but as an agent, because it is not God's perspective, it is not available 4. Full Code

Import TensorFlow as TF import numpy as NP Import gym import random from collections import deque Episdoe = 10000 Step = 10000 env_name = ' mountaincar-v0 ' batch_size = Init_epsilon = 1.0 Final_epsilon = 0.1 Replay_size = 50000 TRAIN_START_S ize = GAMMA = 0.9 def get_weights (shape): weights = tf.truncated_normal (shape = shape, stddev= 0.01) return Tf. Variable (weights) def get_bias (shape): bias = tf.constant (0.01, shape = shape) return TF. Variable (Bias) class DQN (): Def __init__ (self,env): Self.epsilon_step = (Init_epsilon-final_epsilon)/10 Self.action_dim = ENV.ACTION_SPACE.N print (env.observation_space) Self.state_dim = Env.obser Vation_space.shape[0] Self.neuron_num = Self.replay_buffer = Deque () Self.epsilon = Init_epsil On self.sess = tf. InteractiveSession () self.init_network () Self.sess.run (Tf.initialize_all_variables ()) def Init_ne
    Twork (self):    Self.input_layer = Tf.placeholder (Tf.float32, [None, Self.state_dim]) Self.action_input = Tf.placeholder ( Tf.float32, [None, Self.action_dim]) Self.y_input = Tf.placeholder (Tf.float32, [none]) W1 = get_weights ([Self.state_dim, self.neuron_num]) B1 = Get_bias ([self.neuron_num]) Hidden_layer = Tf.nn.relu (TF.MATM  UL (Self.input_layer, W1) + b1) W2 = Get_weights ([Self.neuron_num, Self.action_dim]) b2 = Get_bias (


        [Self.action_dim]) Self. Q_value = Tf.matmul (Hidden_layer, W2) + b2 value = Tf.reduce_sum (self.  Q_value, self.action_input), reduction_indices = 1) self.cost = Tf.reduce_mean (Tf.square (value-self.y_input ) Self.optimizer = Tf.train.RMSPropOptimizer (0.00025,0.99,0.0,1e-6). Minimize (Self.cost) return D
        EF Percieve (Self, State, action, reward, Next_State, done): One_hot_action = Np.zeros ([Self.action_dim]) one_hot_action[ Action] = 1 self.replay_buffer.append ([state, one_hot_action, reward, next_state, done]) if Len (s Elf.replay_buffer) > REPLAY_SIZE:self.replay_buffer.popleft () If Len (self.replay_buffer) > T RAIN_START_SIZE:self.train () def train (self): Mini_batch = Random.sample (Self.replay_buffer, BA
        tch_size) State_batch = [data[0] for data in mini_batch] Action_batch = [data[1] for data in Mini_batch]
        Reward_batch = [data[2] for data in mini_batch] Next_state_batch = [data[3] for data in Mini_batch] Done_batch = [data[4] for data in mini_batch] Y_batch = [] Next_state_reward = self.  Q_value.eval (feed_dict = {Self.input_layer:next_state_batch}) for I in Range (batch_size): if done_batch[i]: y_batch.append (reward_batch[i)) Else:y_batch.append (re ward_batch[i] + GAMMA * Np.max (NEXt_state_reward[i]) Self.optimizer.run (feed_dict = {SELF.INPUT_LAYER:STATE_BATC H, Self.action_input:action_batch, self.y_input:y_batch}) r Eturn def get_greedy_action (self, state): value = self. Q_value.eval (feed_dict = {self.input_layer: [state]}) [0] return Np.argmax (value) def get_action (self , state): If Self.epsilon > FINAL_EPSILON:self.epsilon-= self.epsilon_step if Random.rand Om () < Self.epsilon:return random.randint (0, self.action_dim-1) else:return self. Get_greedy_action (state) def main (): env = Gym.make (env_name) Agent = DQN (env) to episode in range (Episdoe)
            : Total_reward = 0 state = Env.reset () as step in range (step): Env.render () Action = agent.get_action (state) Next_State, reward, done, _ = Env.step (action) Total_reward + = Reward Agent.percieve (state, action, reward, Next_State, done  If done:break state = next_state print ' Total reward this episode is: ', total_reward if __name__ = = "__main__": Main ()

reference materials

Https://zhuanlan.zhihu.com/p/21477488?refer=intelligentunit

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Enhanced Learning Series (II): A simple example of enhanced learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Enhanced Learning Series (II): A simple example of enhanced learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support