Trust Region Policy Optimization (ICML, 2015)

1 Introduction

  1. policy optimization categories

    1. policy iteration (GPI)
    2. PG (e.g. TRPO)
    3. derivative-free(无导数) optimization methods

2 Preliminaries

  1. Consider an infinite-horizon discounted MDP

    1. instead an average reward one
  2. objective measurement η(π~)\eta(\tilde{\pi})η(π~) (Kakade, 2002, TODO)
    η(π~)=η(π)+Es0,a0,⋯∼π~[∑t=0∞γtAπ(st,at)]=η(π)+∑sρπ~(s)∑aπ~(a∣s)Aπ(s,a)\eta(\tilde{\pi}) =\eta(\pi)+\mathbb{E}_{s_{0}, a_{0}, \cdots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right] \\ =\eta(\pi)+\sum_{s} \rho_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) η(π~)=η(π)+Es0​,a0​,⋯∼π~​[t=0∑∞​γtAπ​(st​,at​)]=η(π)+s∑​ρπ~​(s)a∑​π~(a∣s)Aπ​(s,a)

    1. improving the expectation on right hand side leads to policy improvment
    2. but ρπ~(s)\rho_{\tilde{\pi}}(s)ρπ~​(s) is hard to estimate
    3. so use local approximation instead
  3. local approximation objective measurement Lπ(π~)L_{\pi} (\tilde{\pi})Lπ​(π~) (Kakade, 2002, TODO)
    Lπ(π~)=η(π)+∑sρπ(s)∑aπ~(a∣s)Aπ(s,a)L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) Lπ​(π~)=η(π)+s∑​ρπ​(s)a∑​π~(a∣s)Aπ​(s,a)

    1. match η(π~)\eta(\tilde{\pi})η(π~) to first order
    2. lower bound
    3. limited form of policy improvement

3 Monotonic Improvement Guarantee for General Stochastic Policies

  1. Theorem 1: lower bound with General Stochastic Policies
    η(π~)≥Lπ(π~)−CDKLmax⁡(π,π~)where C=4ϵγ(1−γ)2,ϵ=max⁡s,a∣Aπ(s,a)∣\begin{aligned} \eta(\tilde{\pi}) & \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) \\ & \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}} , \epsilon=\max _{s, a}\left|A_{\pi}(s, a)\right| \end{aligned} η(π~)​≥Lπ​(π~)−CDKLmax​(π,π~) where C=(1−γ)24ϵγ​,ϵ=s,amax​∣Aπ​(s,a)∣​

    • KL can be seen as penalty
    • CCC is a large constant, thus leads to a small step size
  2. Algorithm 1: monotonic policy iteration
    • advantage should be accurate
    • can be extend to continunous state and action sapce

4 Optimization of Parameterized Policies

  1. Approximations

    1. trust region constrain

      • step size in Algorithm 1 could be very small
      • change KL penalty to a constrain
    2. average KL
      • max operation involving traversal the whole state space
      • use average KL instead
        DˉKLρold(θ1,θ2):=Es∼ρold[DKL(πθ1(⋅∣s)∥πθ2(⋅∣s))]\bar{D}_{\mathrm{KL}}^{\rho_{old}}\left(\theta_{1}, \theta_{2}\right):=\mathbb{E}_{\textcolor{red}{s \sim \rho_{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{1}}(\cdot \mid s) \| \pi_{\theta_{2}}(\cdot \mid s)\right)\right]DˉKLρold​​(θ1​,θ2​):=Es∼ρold​​[DKL​(πθ1​​(⋅∣s)∥πθ2​​(⋅∣s))]
  2. optimization problem up to now
    maximize⁡θ∑sρθold (s)∑aπθ(a∣s)Aθold (s,a)subject to DˉKLρold (θold ,θ)≤δ\begin{array}{c} \underset{\theta}{\operatorname{maximize}} \sum_{s} \rho_{\theta_{\text {old }}}(s) \sum_{a} \pi_{\theta}(a \mid s) A_{\theta_{\text {old }}}(s, a) \\ \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_{\text {old }}}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{array} θmaximize​∑s​ρθold ​​(s)∑a​πθ​(a∣s)Aθold ​​(s,a) subject to DˉKLρold ​​(θold ​,θ)≤δ​

5 Sample-Based Estimation of the Objective and Constraint

maximize⁡θEs∼ρθold,a∼q[πθ(a∣s)q(a∣s)Qθold(s,a)]subject to Es∼ρθold[DKL(πθold(⋅∣s)∥πθ(⋅∣s))]≤δ\begin{array}{l} \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}, a \sim q}\left[\frac{\pi_{\theta}(a \mid s)}{q(a \mid s)} Q_{\theta_{\mathrm{old}}}(s, a)\right] \\ \text { subject to } \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\mathrm{old}}}(\cdot \mid s) \| \pi_{\theta}(\cdot \mid s)\right)\right] \leq \delta \end{array} θmaximize​Es∼ρθold​,a∼q​[q(a∣s)πθ​(a∣s)​Qθold​​(s,a)] subject to Es∼ρθold​​[DKL​(πθold​​(⋅∣s)∥πθ​(⋅∣s))]≤δ​

  1. s∼ρθolds \sim \rho_{\theta \mathrm{old}}s∼ρθold​

    • by defination of expectation
  2. a∼qa \sim qa∼q
    • by important sampling
  3. Qθold(s,a)Q_{\theta_{\mathrm{old}}}(s, a)Qθold​​(s,a)
    • by the fact that

      1. A(s,a)=Q(s,a)−V(s)A(s, a)=Q(s, a)-V(s)A(s,a)=Q(s,a)−V(s)
      2. ∑sVθold(s)=constant\sum_{s}V^{\theta_{old}}(s)=constant∑s​Vθold​(s)=constant
    • so that MC can be used to estimate Q (mabey there is no proper MC estimation for advantage)
    • estimation methods
      1. single path
      2. Vine
        • better estimation
        • but need more simulator
        • limit to the system where states can be recover

6 Practical Algorithm

  1. Update Q by collected trajectoriess
  2. Calulate objective and its constrain
  3. Solve the constrain problem by Conjugate gradient alg TODO
    • references

      1. paper Appendix C
      2. post with thorem and implement https://www.telesens.co/2018/06/09/efficiently-computing-the-fisher-vector-product-in-trpo/

8 Experiments

  1. TRPO is related to prior methods (e.g. natural policy
    gradient) but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient.

    • These results provide empirical evidence that constraining the KL divergence is a more robust way to choose step sizes and make fast, consistent progress, compared to using a fixed penalty.
  2. can obtain high-quality locomotion controllers from scratch, which is considered to be a hard problem.
  3. the method we proposed is scalable and has strong theoretical foundations.

[RL 9] Trust Region Policy Optimization (ICML, 2015)相关推荐

  1. Proximal Policy Optimization (PPO) 算法理解:从策略梯度开始

    近端策略优化(PPO)算法是OpenAI在2017提出的一种强化学习算法,被认为是目前强化学习领域的SOTA方法,也是适用性最广的算法之一.本文将从PPO算法的基础入手,理解从传统策略梯度算法(例如R ...

  2. 深度增强学习(DRL)漫谈 - 信赖域(Trust Region)系方法

    一.背景 深度学习的兴起让增强学习这个古老的机器学习分支迎来一轮复兴.它们的结合领域-深度增强学习(Deep reinforcement learning, DRL)随着在一系列极具挑战的控制实验场景 ...

  3. 强化学习笔记:PPO 【近端策略优化(Proximal Policy Optimization)】

    1 前言 我们回顾一下policy network: 强化学习笔记:Policy-based Approach_UQI-LIUWJ的博客-CSDN博客 它先去跟环境互动,搜集很多的 路径τ.根据它搜集 ...

  4. ChatGPT 使用 强化学习:Proximal Policy Optimization算法(详细图解)

    ChatGPT 使用 强化学习:Proximal Policy Optimization算法 强化学习中的PPO(Proximal Policy Optimization)算法是一种高效的策略优化方法 ...

  5. 【文献阅读】Proximal Policy Optimization Algorithms

    Author: John Schulman 原文摘要 我们提出了一种新的强化学习的 策略梯度方法,该方法在 与环境互动中进行采样 和 使用随机梯度提升算法优化"surrogate" ...

  6. 深度增强学习PPO(Proximal Policy Optimization)算法源码走读

    原文地址:https://blog.csdn.net/jinzhuojun/article/details/80417179 OpenAI出品的baselines项目提供了一系列deep reinfo ...

  7. 数学知识-- 信赖域(Trust Region)算法是怎么一回事

    信赖域(Trust Region)算法是怎么一回事 转载自: https://www.codelast.com/原创信赖域trust-region算法是怎么一回事/ 如果你关心最优化(Optimiza ...

  8. 【ICML 2015迁移学习论文阅读】Unsupervised Domain Adaptation by Backpropagation (DANN) 反向传播的无监督领域自适应

    会议:ICML 2015 论文题目:Unsupervised Domain Adaptation by Backpropagation 论文地址: http://proceedings.mlr.pre ...

  9. 信赖域(Trust Region)

    信赖域算法与一维搜索算法的区别.联系 在Jorge Nocedal和Stephen J. Wright的<Numerical Optimization>一书的第2.2节介绍了解优化问题的两 ...

最新文章

  1. 写给第十七届,来自十六届的感想与建议
  2. 微信jssdk ajax 获取签名,【Golang版】微信access_token、jsapi_ticket、signature签名算法生成示例,开箱即用...
  3. 周五话运营 | 做个了解用户的精细化运营喵
  4. sourcetree,创建工作流报错:Fatal: Not a gitflow-enabled repo yet. Please run 'git flow init' first.-》解决办法...
  5. OD使用教程20 - 调试篇20
  6. 从java中安装webolgc_Javaweb| 文件下载
  7. 【Qt串口调试助手】1.6 - QTimer定时自动发送
  8. ural 1112,LIS
  9. win7 java下载_Windows7系统下JAVA运行环境下载、安装和设置(第二次更新:2012年03月14日)...
  10. matlab编写算法,Matlab 入门宝典 编程算法大全
  11. Altium Designer PCB等长线设计终极技巧(单端和差分线)
  12. CH2-Java编程基础(7个案例实现)
  13. cass小插件集合_CAD面积插件大全_CAD插件大全_CASS插件大全_小懒人CAD插件老妈砂锅串串香加盟...
  14. MiniUSB管脚接口引脚定义
  15. 凯恩斯主义税收思想概述
  16. 计算机图形学入门(十二)-阴影映射Shadow mapping(为光线追踪准备)
  17. The Little Schemer Fourth Edition,笔记01
  18. 使用BetterTouchTool自定义你的touchBar
  19. 【个人吐槽】:你为什么写
  20. 今天获取的云蹦迪直播软件源码全开源

热门文章

  1. 2020年终总结!新的起航,新的征程
  2. 品牌管理系统(第一个web项目)
  3. Python爬虫企查查
  4. 计算机网络设备接地规范,网络机房防雷接地的四种方式及静电要求
  5. 基于UML的软件开发过程
  6. 记录:为啥没有雷电4接口的显卡扩展坞与移动硬盘?
  7. C语言教程(四):基础知识(最后一续)
  8. 融会贯通,并行不悖丨2022年8月《中国数据库行业分析报告》发布!
  9. 力扣(13.278)补8.23
  10. C# 实现 简体<--->繁体 的互相转换