价值网络和策略网络的简单融合

最近alphazero都已经出来了,貌似比alphago zero更厉害,在alphazero和alphago zero中使用了比较
新的策略,将价值网络和策略网络进行了融合,即同一个网络,产生两个不同的输出,让两个网络的权重进行
共享,同时进行更新,为了加深理解,在最简单的游戏cartpole上进行了尝试.实际上将价值网络和策略网络
进行融合,实现起来应该是比较简单的,需要注意的一个小问题是,在之前的价值网络和策略网络,其学习率
不一致,因此将两者融合后需要采用较小的学习率,直接给出代码:

https://github.com/zhly0/policy_value.py

import tensorflow as tf
import numpy as np
import random
import gym
import math
import matplotlib.pyplot as pltdef softmax(x):e_x = np.exp(x - np.max(x))out = e_x / e_x.sum()return outdef policy_value():with tf.variable_scope("policy_value"):state = tf.placeholder("float",[None,4])#newvals is future rewardnewvals = tf.placeholder("float",[None,1])w1 = tf.get_variable("w1",[4,10])b1 = tf.get_variable("b1",[10])h1 = tf.nn.relu(tf.matmul(state,w1) + b1)w2 = tf.get_variable("w2",[10,2])b2 = tf.get_variable("b2",[2])w3 = tf.get_variable("w3",[10,1])b3 = tf.get_variable("b3",[1])#policy gradientcalculated = tf.matmul(h1,w2) + b2probabilities = tf.nn.softmax(calculated)actions = tf.placeholder("float",[None,2])advantages = tf.placeholder("float",[None,1])good_probabilities = tf.reduce_sum(tf.multiply(probabilities, actions),reduction_indices=[1])eligibility = tf.log(good_probabilities) * advantagesloss1 = -tf.reduce_sum(eligibility)#value gradientcalculated1 = tf.matmul(h1,w3) + b3diffs = calculated1 - newvalsloss2 = tf.nn.l2_loss(diffs)#policy loss + value lossloss = loss1+loss2optimizer = tf.train.AdamOptimizer(0.01).minimize(loss)#AdamOptimizerreturn probabilities,calculated1, actions,state,advantages, newvals, optimizer, loss1,loss2def run_episode(env, policy_value, sess,is_train = True):    p_probabilities,v_calculated,p_actions, pv_state, p_advantages, v_newvals, pv_optimizer,loss1,loss2 = policy_valueobservation = env.reset()totalreward = 0states = []actions = []advantages = []transitions = []update_vals = []for _ in range(200):# calculate policyobs_vector = np.expand_dims(observation, axis=0)#calculate action according to current stateprobs = sess.run(p_probabilities,feed_dict={pv_state: obs_vector})action = 1 if probs[0][0]<probs[0][1] else 0#take a random action when trainingif is_train:action = 0 if random.uniform(0,1) < probs[0][0] else 1# record the transitionstates.append(observation)actionblank = np.zeros(2)actionblank[action] = 1actions.append(actionblank)# take the action in the environmentold_observation = observationobservation, reward, done, info = env.step(action)transitions.append((old_observation, action, reward))totalreward += rewardif done:break#return totalreward if it is testingif not is_train:return totalreward#trainingfor index, trans in enumerate(transitions):obs, action, reward = trans# calculate discounted monte-carlo returnfuture_reward = 0future_transitions = len(transitions) - indexdecrease = 1for index2 in range(future_transitions):future_reward += transitions[(index2) + index][2] * decreasedecrease = decrease * 0.97obs_vector = np.expand_dims(obs, axis=0)#value function: calculate max reward under current state currentval = sess.run(v_calculated,feed_dict={pv_state: obs_vector})[0][0]# advantage: how much better was this action than normal# 根据实际数据得到future_reward比值函数计算出来的reward要好多少# 训练到后来,这个currentval:即在当前reward会估计的比较准确,在当前state下能够获得的# 最大reward或者平均reward,而有了这个估计,用实际的reward减去这个reward,就可以判断这个# action的好坏,即这个currentval是训练时用来评估某个action的好坏# 用future_reward减去这个最大reward,就得到了这个action# 对应的label,如果比估计的值更大,那说明要根据该参数进行更新,如果比该值小,那说明# 达不到平均水平,那么将将该action对应的梯度进行反向更新(相减为负值),使得下次碰到这个# 类似的state的时候,不再采取这个actionadvantages.append(future_reward - currentval)#advantages.append(future_reward-2.0)update_vals.append(future_reward)# update value functionupdate_vals_vector = np.expand_dims(update_vals, axis=1)advantages_vector = np.expand_dims(advantages, axis=1)#train network_,print_loss1,print_loss2 = sess.run([pv_optimizer,loss1,loss2], feed_dict={pv_state: states,v_newvals: update_vals_vector, p_advantages: advantages_vector, p_actions: actions})print("policy loss ",print_loss1)print("value loss ",print_loss2)return totalrewardenv = gym.make('CartPole-v0')PolicyValue = policy_value()sess = tf.InteractiveSession()sess.run(tf.global_variables_initializer())for i in range(1500):reward = run_episode(env, PolicyValue, sess)t = 0
for _ in range(1000):#env.render()reward = run_episode(env, PolicyValue, sess,False)t += reward
print(t / 1000)

价值网络和策略网络的简单融合相关推荐

训练策略网络和价值网络
阿尔法狗2016版本使用人类高手棋谱数据初步训练策略网络,并使用深度强化学习中的REINFORCE算法进一步训练策略网络.策略网络训练好之后,使用策略网络辅助训练价值网络.零狗(AlphaGo Zer ...
网络营销策略常见方法有哪些？
网络营销策略和常规的营销手段是有不同的,下面给大家列举些常见的网络营销策略方法: 1.网络品牌策略网络营销的重要任务之一就是在互联网上建立并推广企业的品牌,知名企业的网下品牌可以在网上得以延伸,一般 ...
浅谈网络营销思想的演变与网络营销策略的升级
许多企业在网站建设完成之后,经过常规的网站推广措施,没有取得明显效果,然后就不知道下一步的网络营销应该如何进行了,于是只能停留在网站建设和网站推广的阶段.网络营销思想的演变要求网络营销策略与新的环境相 ...
网络策略_你知道网络营销策略有哪些吗？
互联网对于大家来说已经无处不在了,日常生活也因互联网变得更加方便快捷,网络营销也开始替代传统的营销方式了,可是,网络营销并不简单,企业网络营销更需要互联网,那么,你知道网络营销都有哪些策略吗? 网络营 ...
virtualbox 创建桥接网络_VirtualBox桥接网络的简单配置，让虚拟机直接访问网络
VirtualBox桥接网络的简单配置,让虚拟机直接访问网络分类: Linux 2009-08-20 08:59 5071人阅读评论(0) 收藏举报 (1)最新的 VirtualBox 可以简单 ...
网络推广策略之如何稳定新站的关键词排名？
网站在新上线时,可能由于要经历沙盒期,或者是网络推广策略的站长们没把握好网站的优化规律,导致网站排名不太稳定,但也有可能站长们没把一些细节做好,那么对于新站来说,该如何稳定新站的关键词排名呢?下面就网 ...
Windows Phone的网络连接策略
前言在微薄中"有人"(我记得是谁,^_^)问起Windows Phone在锁屏下是否继续链接WiFi,引起了很多WP微博控在讨论,其实我本来不关心这个问题的,因为这个是不可控的, ...
SaaS市场普及网络推广策略最有效
本文讲的是SaaS市场普及网络推广策略最有效,[IT168 资讯]从现在一些SaaS提供商的市场推广来看,已有多种营销策略派上了用场,包括建立渠道网点,通过经销渠道销售:组织会议讲座或研讨会,展开会 ...
KGPolicy：用于推荐的负采样模型（知识图谱策略网络）WWW 2020
论文链接:https://arxiv.org/pdf/2003.05753.pdf 代码链接:https://github.com/xiangwang1223/kgpolicy 摘要合理的处理缺失数 ...

价值网络和策略网络的简单融合

价值网络和策略网络的简单融合相关推荐

最新文章

热门文章