DDPG（6）

1、引用Python库

import gym
import tensorflow as tf
import numpy as np
from ou_noise import OUNoise
from critic_network import CriticNetwork
from actor_network_bn import ActorNetwork
from replay_buffer import ReplayBuffer

2、定义参数

# Hyper Parameters:REPLAY_BUFFER_SIZE = 1000000
REPLAY_START_SIZE = 10000
BATCH_SIZE = 64
GAMMA = 0.99

3、定义类

class DDPG:"""docstring for DDPG"""def __init__(self, env):self.name = 'DDPG' # name for uploading resultsself.environment = env# Randomly initialize actor network and critic network# with both their target networksself.state_dim = env.observation_space.shape[0]

（以下函数均在类DDPG中定义）

3.1 初始化函数

    def __init__(self, env):self.name = 'DDPG' # name for uploading resultsself.environment = env# Randomly initialize actor network and critic network# with both their target networksself.state_dim = env.observation_space.shape[0]self.action_dim = env.action_space.shape[0]self.sess = tf.InteractiveSession()self.actor_network = ActorNetwork(self.sess,self.state_dim,self.action_dim)self.critic_network = CriticNetwork(self.sess,self.state_dim,self.action_dim)# initialize replay bufferself.replay_buffer = ReplayBuffer(REPLAY_BUFFER_SIZE)# Initialize a random process the Ornstein-Uhlenbeck process for action explorationself.exploration_noise = OUNoise(self.action_dim)

初始化了状态、动作的维度，actor_network，critic_network，经验池和ou-noise.

tf.InteractiveSession()

参考这里，在运行图的时候插入一些计算图，便于交互环境处理。

3.2 train（）函数

    def train(self):#print "train step",self.time_step# Sample a random minibatch of N transitions from replay bufferminibatch = self.replay_buffer.get_batch(BATCH_SIZE)state_batch = np.asarray([data[0] for data in minibatch])action_batch = np.asarray([data[1] for data in minibatch])reward_batch = np.asarray([data[2] for data in minibatch])next_state_batch = np.asarray([data[3] for data in minibatch])done_batch = np.asarray([data[4] for data in minibatch])                          #从经验池中采样得到经验序列# for action_dim = 1action_batch = np.resize(action_batch,[BATCH_SIZE,self.action_dim])# Calculate y_batchnext_action_batch = self.actor_network.target_actions(next_state_batch)q_value_batch = self.critic_network.target_q(next_state_batch,next_action_batch)#q值通过target_critic网络计算（确定性策略梯度））y_batch = []  for i in range(len(minibatch)): if done_batch[i]:y_batch.append(reward_batch[i])else :y_batch.append(reward_batch[i] + GAMMA * q_value_batch[i])                 #通过经验池数据计算y值y_batch = np.resize(y_batch,[BATCH_SIZE,1])# Update critic by minimizing the loss Lself.critic_network.train(y_batch,state_batch,action_batch)                        #通过最小化二次方误差调整critic网络# Update the actor policy using the sampled gradient:action_batch_for_gradients = self.actor_network.actions(state_batch)               #actor网络通过经验池中的state产生动作q_gradient_batch = self.critic_network.gradients(state_batch,action_batch_for_gradients)  #critic网络通过上述状态-动作对计算Q对于a的梯度self.actor_network.train(q_gradient_batch,state_batch)                             #通过梯度和state调整actor网络# Update the target networksself.actor_network.update_target()self.critic_network.update_target()                                                 #更新target网络

整个actor-critic的一次训练过程。（对照伪代码）

np.asarray

参考这里，将数据结构转化为ndarray.

np.resize

参考这里，对原始数组的维度进行修改并保留。

3.3 关于action

    def noise_action(self,state):# Select action a_t according to the current policy and exploration noiseaction = self.actor_network.action(state)return action+self.exploration_noise.noise()

返回一个带噪声（探索）的动作。随机性。（exploration_noise在前面定义了就是ou-noise）

    def action(self,state):action = self.actor_network.action(state)return action

返回一个不带噪声的动作。确定性。

3.4 perceive（）函数

    def perceive(self,state,action,reward,next_state,done):# Store transition (s_t,a_t,r_t,s_{t+1}) in replay bufferself.replay_buffer.add(state,action,reward,next_state,done)# Store transitions to replay start size then start trainingif self.replay_buffer.count() >  REPLAY_START_SIZE:self.train()#if self.time_step % 10000 == 0:#self.actor_network.save_network(self.time_step)#self.critic_network.save_network(self.time_step)# Re-iniitialize the random process when an episode endsif done:self.exploration_noise.reset()

向经验池中存储数据，存满时开始训练。

DDPG（6）_ddpg相关推荐

MATLAB强化学习工具箱（十一）训练DDPG智能体控制飞行机器人
训练DDPG智能体控制飞行器飞行机器人模型创建集成模型动作与观察创建环境接口重置函数创建DDPG智能体训练智能体 DDPG智能体仿真本示例说明如何训练深度确定性策略梯度(DDPG)智能 ...
深度强化学习（DRL）简介与常见算法（DQN，DDPG，PPO，TRPO，SAC）分类
简单介绍深度强化学习的基本概念,常见算法.流程及其分类(持续更新中),方便大家更好的理解.应用强化学习算法,更好地解决各自领域面临的前沿问题.欢迎大家留言讨论,共同进步. (PS:如果仅关注算法实现, ...
强化学习教程（四）：从PDG到DDPG的原理及tf代码实现详解
强化学习教程(四):从PDG到DDPG的原理及tf代码实现详解原创 lrhao 公众号:ChallengeHub 收录于话题 #强化学习教程前言在前面强化学习教程(三)中介绍了基于策略「PG」算 ...
基础算法篇（七），确定性策略的DPG与DDPG
我们在前面两章介绍了Policy Based范畴的经典策略梯度方法和基于AC框架的PPO方法,在上述方法中,策略梯度都为如下形式: ∇ J ( θ ) = E τ ∼ P ( τ ; θ ) [ R ...
DDPG（Deep Deterministic Policy Gradient）
Hi,这是第二篇算法简介呀论文链接:"Continuous control with deep reinforcement learning." ,2016 文章概述这篇文 ...
莫烦强化学习笔记整理（九）DDPG
莫烦强化学习笔记整理(九)DDPG 1.DDPG 要点 2.DDPG 算法 actor critic actor与critic结合类似于DQN的记忆库回合更新链接: DDPG代码. 1.DDPG ...
深度增强学习DDPG（Deep Deterministic Policy Gradient）算法源码走读
原文链接:https://blog.csdn.net/jinzhuojun/article/details/82556127 本文是基于OpenAI推出deep reinforcement learn ...
深度丨深度强化学习研究的短期悲观与长期乐观（长文）
文章来源:机器之心深度强化学习是最接近于通用人工智能(AGI)的范式之一.不幸的是,迄今为止这种方法还不能真正地奏效.在本文中,作者将为我们解释深度强化学习没有成功的原因,介绍成功的典型案例,并指出 ...
126篇殿堂级深度学习论文分类整理从入门到应用（上）
如果你有非常大的决心从事深度学习,又不想在这一行打酱油,那么研读大牛论文将是不可避免的一步.而作为新人,你的第一个问题或许是:"论文那么多,从哪一篇读起?" 本文将试图解决这个问题 ...

DDPG（6）_ddpg

DDPG（6）_ddpg相关推荐

最新文章

热门文章