Simple integration of Value network and strategy network

Last Update:2018-08-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently Alphazero have been out, seemingly more severe than Alphago zero, in Alphazero and Alphago Zero used a comparison
New strategy, the value network and the Strategy network is fused, that is, the same network, resulting in two different outputs, so that the weight of two network
Sharing, updating at the same time, in order to deepen understanding, in the simplest game cartpole to try. Actually, the value network and the Policy Network
To achieve convergence, it should be relatively simple, a small problem to note is that in the previous value network and Strategy network, its learning rate
Inconsistent, so the combination of the two need to adopt a small learning rate, directly give the code:

https://github.com/zhly0/policy_value.py

Import TensorFlow as TF import numpy as NP import random import gym import math import matplotlib.pyplot as Plt def SOFTM Ax (x): e_x = Np.exp (X-np.max (x)) out = E_x/e_x.sum () return off Def Policy_value (): With tf.variable _scope ("Policy_value"): state = Tf.placeholder (' float ', [none,4]) #newvals is future reward

        s = Tf.placeholder ("float", [none,1]) w1 = tf.get_variable ("W1", [4,10]) B1 = tf.get_variable ("B1", [10]) H1 = Tf.nn.relu (Tf.matmul (STATE,W1) + b1) W2 = tf.get_variable ("W2", [10,2]) b2 = tf.get_variable ("b
        2 ", [2]) W3 = tf.get_variable (" W3 ", [10,1]) B3 = tf.get_variable (" B3 ", [1]) #policy gradient Calculated = Tf.matmul (h1,w2) + b2 probabilities = Tf.nn.softmax (calculated) actions = Tf.placeholder (" Float ", [none,2]) advantages = Tf.placeholder (" float ", [none,1]) good_probabilities = Tf.reduce_sum (tf.mult Iply (Probabilities, ACTIONS), reduction_indices=[1]) Eligibility = Tf.log (good_probabilities) * Advantages Loss1 =-tf.reduce_sum (E
        ligibility) #value Gradient calculated1 = Tf.matmul (H1,W3) + B3 diffs = Calculated1-newvals Loss2 = Tf.nn.l2_loss (diffs) #policy loss + value loss loss = Loss1+loss2 Optimizer = tf.tr Ain. Adamoptimizer (0.01). Minimize (loss) #AdamOptimizer return probabilities,calculated1, Actions,state,advantages, NEWV ALS, Optimizer, Loss1,loss2 def run_episode (env, policy_value, Sess,is_train = True): p_probabilities,v_calculate
    D,p_actions, Pv_state, P_advantages, v_newvals, pv_optimizer,loss1,loss2 = policy_value observation = Env.reset () Totalreward = 0 states = [] actions = [] advantages = [] Transitions = [] Update_vals = [] for _ In range: # calculate policy obs_vector = np.expand_dims (observation, axis=0) #calculate AC tion according to CURrent state probs = Sess.run (p_probabilities,feed_dict={pv_state:obs_vector}) Action = 1 if probs[0][0]& Lt;probs[0][1] Else 0 #take a random action when training if Is_train:action = 0 if random.un Iform (0,1) < probs[0][0] Else 1 # Record the Transition states.append (observation) Actionblank = Np.zeros (2) actionblank[action] = 1 Actions.append (actionblank) # Take the action in the Environ ment old_observation = observation observation, reward, done, info = env.step (action) transitions. Append ((Old_observation, Action, reward)) Totalreward + + reward if done:break #return to Talreward If it is testing if not Is_train:return totalreward #training to index, trans in Enumerat E (Transitions): Obs, action, reward = trans # calculate discounted Monte-carlo return Future_rewa RD = 0 Future_trAnsitions = Len (transitions)-Index decrease = 1 for index2 in range (future_transitions): fut Ure_reward + + transitions[(INDEX2) + index][2] * decrease decrease = decrease * 0.97 obs_vector = NP.E Xpand_dims (Obs, axis=0) #value function:calculate Max reward under current state Currentval = Sess.run (v
        _calculated,feed_dict={pv_state:obs_vector}) [0][0] # advantage:how Much better is this action than normal # from the actual data to get the Future_reward ratio function calculated reward how much better # training to later, this currentval: that in the current reward will be estimated more accurate, under the current state can get the maximum # Reward or average reward, and with this estimate, with the actual reward minus the reward, you can judge the quality of this action, that is, the currentval is used to evaluate an action during training. # with Futu Re_reward minus this maximum reward, you get this action # corresponding to the label, if the value is greater than the estimate, that means to update according to this parameter, if it is smaller than that, it shows that # is not the average, then the action corresponding to the ladder

        To reverse update (minus a negative value) so that the next time you encounter this # similar state, you will not take this action advantages.append (future_reward-currentval) #advantages. Append (Future_reward-2.0) Update_vals.append (future_reward) # update value function Update_vals_vector = Np.expand_d IMS (Update_vals, Axis=1) Advantages_vector = np.expand_dims (Advantages, Axis=1) #train network _,print_loss1, Print_loss2 = Sess.run ([Pv_optimizer,loss1,loss2], Feed_dict={pv_state:states,v_newvals:update_vals_vector, P_ Advantages:advantages_vector, p_actions:actions}) print ("policy loss", Print_loss1) print ("Value loss", print_l OSS2) return Totalreward env = gym.make (' cartpole-v0 ') Policyvalue = Policy_value () Sess = tf. InteractiveSession () Sess.run (Tf.global_variables_initializer ()) for I in range (1500): Reward = Run_episode (env, Pol
    Icyvalue, Sess) t = 0 for _ in range (1000): #env. Render () reward = Run_episode (env, Policyvalue, Sess,false) T + + reward print (t/1000)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Simple integration of Value network and strategy network

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Simple integration of Value network and strategy network

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support