Recently Alphazero have been out, seemingly more severe than Alphago zero, in Alphazero and Alphago Zero used a comparison
New strategy, the value network and the Strategy network is fused, that is, the same network, resulting in two different outputs, so that the weight of two network
Sharing, updating at the same time, in order to deepen understanding, in the simplest game cartpole to try. Actually, the value network and the Policy Network
To achieve convergence, it should be relatively simple, a small problem to note is that in the previous value network and Strategy network, its learning rate
Inconsistent, so the combination of the two need to adopt a small learning rate, directly give the code:
https://github.com/zhly0/policy_value.py
Import TensorFlow as TF import numpy as NP import random import gym import math import matplotlib.pyplot as Plt def SOFTM Ax (x): e_x = Np.exp (X-np.max (x)) out = E_x/e_x.sum () return off Def Policy_value (): With tf.variable _scope ("Policy_value"): state = Tf.placeholder (' float ', [none,4]) #newvals is future reward
s = Tf.placeholder ("float", [none,1]) w1 = tf.get_variable ("W1", [4,10]) B1 = tf.get_variable ("B1", [10]) H1 = Tf.nn.relu (Tf.matmul (STATE,W1) + b1) W2 = tf.get_variable ("W2", [10,2]) b2 = tf.get_variable ("b
2 ", [2]) W3 = tf.get_variable (" W3 ", [10,1]) B3 = tf.get_variable (" B3 ", [1]) #policy gradient Calculated = Tf.matmul (h1,w2) + b2 probabilities = Tf.nn.softmax (calculated) actions = Tf.placeholder (" Float ", [none,2]) advantages = Tf.placeholder (" float ", [none,1]) good_probabilities = Tf.reduce_sum (tf.mult Iply (Probabilities, ACTIONS), reduction_indices=[1]) Eligibility = Tf.log (good_probabilities) * Advantages Loss1 =-tf.reduce_sum (E
ligibility) #value Gradient calculated1 = Tf.matmul (H1,W3) + B3 diffs = Calculated1-newvals Loss2 = Tf.nn.l2_loss (diffs) #policy loss + value loss loss = Loss1+loss2 Optimizer = tf.tr Ain. Adamoptimizer (0.01). Minimize (loss) #AdamOptimizer return probabilities,calculated1, Actions,state,advantages, NEWV ALS, Optimizer, Loss1,loss2 def run_episode (env, policy_value, Sess,is_train = True): p_probabilities,v_calculate
D,p_actions, Pv_state, P_advantages, v_newvals, pv_optimizer,loss1,loss2 = policy_value observation = Env.reset () Totalreward = 0 states = [] actions = [] advantages = [] Transitions = [] Update_vals = [] for _ In range: # calculate policy obs_vector = np.expand_dims (observation, axis=0) #calculate AC tion according to CURrent state probs = Sess.run (p_probabilities,feed_dict={pv_state:obs_vector}) Action = 1 if probs[0][0]& Lt;probs[0][1] Else 0 #take a random action when training if Is_train:action = 0 if random.un Iform (0,1) < probs[0][0] Else 1 # Record the Transition states.append (observation) Actionblank = Np.zeros (2) actionblank[action] = 1 Actions.append (actionblank) # Take the action in the Environ ment old_observation = observation observation, reward, done, info = env.step (action) transitions. Append ((Old_observation, Action, reward)) Totalreward + + reward if done:break #return to Talreward If it is testing if not Is_train:return totalreward #training to index, trans in Enumerat E (Transitions): Obs, action, reward = trans # calculate discounted Monte-carlo return Future_rewa RD = 0 Future_trAnsitions = Len (transitions)-Index decrease = 1 for index2 in range (future_transitions): fut Ure_reward + + transitions[(INDEX2) + index][2] * decrease decrease = decrease * 0.97 obs_vector = NP.E Xpand_dims (Obs, axis=0) #value function:calculate Max reward under current state Currentval = Sess.run (v
_calculated,feed_dict={pv_state:obs_vector}) [0][0] # advantage:how Much better is this action than normal # from the actual data to get the Future_reward ratio function calculated reward how much better # training to later, this currentval: that in the current reward will be estimated more accurate, under the current state can get the maximum # Reward or average reward, and with this estimate, with the actual reward minus the reward, you can judge the quality of this action, that is, the currentval is used to evaluate an action during training. # with Futu Re_reward minus this maximum reward, you get this action # corresponding to the label, if the value is greater than the estimate, that means to update according to this parameter, if it is smaller than that, it shows that # is not the average, then the action corresponding to the ladder
To reverse update (minus a negative value) so that the next time you encounter this # similar state, you will not take this action advantages.append (future_reward-currentval) #advantages. Append (Future_reward-2.0) Update_vals.append (future_reward) # update value function Update_vals_vector = Np.expand_d IMS (Update_vals, Axis=1) Advantages_vector = np.expand_dims (Advantages, Axis=1) #train network _,print_loss1, Print_loss2 = Sess.run ([Pv_optimizer,loss1,loss2], Feed_dict={pv_state:states,v_newvals:update_vals_vector, P_ Advantages:advantages_vector, p_actions:actions}) print ("policy loss", Print_loss1) print ("Value loss", print_l OSS2) return Totalreward env = gym.make (' cartpole-v0 ') Policyvalue = Policy_value () Sess = tf. InteractiveSession () Sess.run (Tf.global_variables_initializer ()) for I in range (1500): Reward = Run_episode (env, Pol
Icyvalue, Sess) t = 0 for _ in range (1000): #env. Render () reward = Run_episode (env, Policyvalue, Sess,false) T + + reward print (t/1000)