强化学习—— Soft Actor-Critic(SAC算法

  • 1. 基本概念
    • 1.1 soft Q-value
    • 1.2 soft state value function
    • 1.3 Soft Policy Evaluation
    • 1.4 policy improvement
    • 1.5 soft policy improvemrnt
    • 1.5 soft policy iteration
  • 2. soft actor critic
    • 2.1 soft value function
    • 2.2 soft Q-function
    • 2.3 policy improvement
  • 3. 算法流程

1. 基本概念

1.1 soft Q-value

τπQ(st,at)=r(st,at)+γ⋅Est+1∼p[V(st+1)]\tau ^\pi Q(s_t,a_t)=r(s_t,a_t) + \gamma \cdot E_{s_{t +1}\sim p}[V(s_{t+1})]τπQ(st​,at​)=r(st​,at​)+γ⋅Est+1​∼p​[V(st+1​)]

1.2 soft state value function

V(st)=Eat∼π[Q(st,at)−α⋅logπ(at∣st)]V(s_t)=E_{a_t \sim \pi}[Q(s_t,a_t)-\alpha \cdot log\pi(a_t|s_t)]V(st​)=Eat​∼π​[Q(st​,at​)−α⋅logπ(at​∣st​)]

1.3 Soft Policy Evaluation

Qk+1=τπQkQ^{k+1}=\tau^\pi Q^kQk+1=τπQk
当k趋于无穷时,QkQ^kQk将收敛至π\piπ的soft Q-value。
证明:
rπ(st,at)=r(st,at)+γ⋅Est+1∼p[H(π(⋅∣st+1))]r_\pi(s_t,a_t)=r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p}[H(\pi(\cdot | s_{t+1}))]rπ​(st​,at​)=r(st​,at​)+γ⋅Est+1​∼p​[H(π(⋅∣st+1​))]
Q(st,at)=r(st,at)+γ⋅Est+1∼p[H(π(⋅∣st+1))+Est+1,at+1∼ρπ[Q(st+1,at+1)]Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p}[H(\pi(\cdot | s_{t+1})) + E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})]Q(st​,at​)=r(st​,at​)+γ⋅Est+1​∼p​[H(π(⋅∣st+1​))+Est+1​,at+1​∼ρπ​​[Q(st+1​,at+1​)]
Q(st,at)=r(st,at)+γ⋅Est+1,at+1∼ρπ[−log(π(at+1∣st+1))+Est+1,at+1∼ρπ[Q(st+1,at+1)]Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1},a_{t+1}\sim \rho_\pi}[-log(\pi(a_{t+1} | s_{t+1})) + E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})]Q(st​,at​)=r(st​,at​)+γ⋅Est+1​,at+1​∼ρπ​​[−log(π(at+1​∣st+1​))+Est+1​,at+1​∼ρπ​​[Q(st+1​,at+1​)]
Q(st,at)=r(st,at)+γ⋅Est+1,at+1∼ρπ[Q(st+1,at+1)−log(π(at+1∣st+1))Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})-log(\pi(a_{t+1} | s_{t+1})) Q(st​,at​)=r(st​,at​)+γ⋅Est+1​,at+1​∼ρπ​​[Q(st+1​,at+1​)−log(π(at+1​∣st+1​))
当|A|<∞时,可以保证熵有界,因而能保证收敛。

1.4 policy improvement

πnew=argminπ′∈ΠDKL(π′(⋅∣st)∣∣exp(Qπold(st,⋅))Zπold(st))\pi_{new}=argmin_{\pi^{'}\in \Pi}D_{KL}(\pi^{'}(\cdot|s_t)||\frac{exp(Q^{\pi_{old}}(s_t,\cdot))}{Z^{\pi_{old}}(s_t)})πnew​=argminπ′∈Π​DKL​(π′(⋅∣st​)∣∣Zπold​(st​)exp(Qπold​(st​,⋅))​)

1.5 soft policy improvemrnt

Qπnew(st,at)≥Qπold(st,at)Q^{\pi_{new}}(s_t,a_t)≥Q^{\pi_{old}}(s_t,a_t)Qπnew​(st​,at​)≥Qπold​(st​,at​)
s.t.为:
πold∈Π,(st,at)∈S×A,∣A∣<∞\pi_{old}\in \Pi,(s_t,a_t)\in S × A, |A| < ∞πold​∈Π,(st​,at​)∈S×A,∣A∣<∞
证明如下:
πnew=argminπ′∈ΠDKL(π′(⋅∣st)∣∣exp(Qπold(st,⋅)−log(Z(st))))=argminπ′∈ΠJπold(π′(⋅∣st))\pi_{new}=argmin_{\pi^{'}\in \Pi}D_{KL}(\pi^{'}(\cdot|s_t)||exp(Q^{\pi_{old}}(s_t,\cdot)-log(Z(s_t))))\\ =argmin_{\pi^{'}\in \Pi}J_{\pi_{old}}(\pi^{'}(\cdot|s_t))πnew​=argminπ′∈Π​DKL​(π′(⋅∣st​)∣∣exp(Qπold​(st​,⋅)−log(Z(st​))))=argminπ′∈Π​Jπold​​(π′(⋅∣st​))
Jπold(π′(⋅∣st))=Eat∼π′[log(π′(st,at))−Qπold(st,at)+log(Z(st))]J_{\pi_{old}}(\pi^{'}(\cdot|s_t)) = E_{a_t \sim \pi^{'}}[log(\pi^{'}(s_t,a_t))-Q^{\pi_{old}}(s_t,a_t)+log(Z(s_t))]Jπold​​(π′(⋅∣st​))=Eat​∼π′​[log(π′(st​,at​))−Qπold​(st​,at​)+log(Z(st​))]
由于一直可以取πnew=πold\pi_{new}=\pi_{old}πnew​=πold​,所有总能满足:
Eat∼πnew[log(πnew(at∣st))−Qπold(st,at)]≤Eat∈πold[log(πold(at∣st))−Qπold(st,at)]E_{a_t\sim \pi_{new}}[log(\pi_{new}(a_t|s_t))-Q^{\pi_{old}}(s_t,a_t)]≤E_{a_t \in \pi_{old}}[log(\pi_{old}(a_t|s_t))-Q^{\pi_{old}}(s_t,a_t)]Eat​∼πnew​​[log(πnew​(at​∣st​))−Qπold​(st​,at​)]≤Eat​∈πold​​[log(πold​(at​∣st​))−Qπold​(st​,at​)]

Eat∼πnew[log(πnew(at∣st))−Qπold(st,at)]≤−Vπold(st)Eat∼πnew[Qπold(st,at)−log(πnew(at∣st))]≥Vπold(st)E_{a_t\sim \pi_{new}}[log(\pi_{new}(a_t|s_t))-Q^{\pi_{old}}(s_t,a_t)]≤ - V^{\pi_{old}}(s_t)\\E_{a_t\sim \pi_{new}}[Q^{\pi_{old}}(s_t,a_t)-log(\pi_{new}(a_t|s_t))]≥V^{\pi_{old}}(s_t)Eat​∼πnew​​[log(πnew​(at​∣st​))−Qπold​(st​,at​)]≤−Vπold​(st​)Eat​∼πnew​​[Qπold​(st​,at​)−log(πnew​(at​∣st​))]≥Vπold​(st​)
Qπold(st,at)=r(st,at)+γ⋅Est+1∼p[Vπold(st+1)]≤r(st,at)+γ⋅Est+1∼pEat+1∼πnew[Qπold(st,at)−log(πnew(at∣st)]≤..........≤Qπnew(st,at)Q^{\pi_{old}}(s_t,a_t)=r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p }[V^{\pi_{old}}(s_{t+1})]\\ ≤r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p E_{a_{t+1}\sim \pi_{new}}}[Q^{\pi_{old}}(s_t,a_t)-log(\pi_{new}(a_t|s_t)]\\ ≤..........\\ ≤Q^{\pi_{new}}(s_t,a_t)Qπold​(st​,at​)=r(st​,at​)+γ⋅Est+1​∼p​[Vπold​(st+1​)]≤r(st​,at​)+γ⋅Est+1​∼pEat+1​∼πnew​​​[Qπold​(st​,at​)−log(πnew​(at​∣st​)]≤..........≤Qπnew​(st​,at​)

1.5 soft policy iteration

假设:∣A∣<∞;π∈Π|A|<∞;\pi\in\Pi∣A∣<∞;π∈Π
经过不断地soft policy evaluation和policy improvement,最终policy会收敛至π⋆\pi^{\star}π⋆,其满足
Qπ⋆(st,at)≥Qπ(st,at);其中π∈ΠQ^{\pi^\star}(s_t,a_t)≥Q^{\pi}(s_t,a_t);其中\pi\in\PiQπ⋆(st​,at​)≥Qπ(st​,at​);其中π∈Π

2. soft actor critic

2.1 soft value function

  1. loss function
    JV(ψ)=Est∼D[12(Vψ(st)−Eat∼πϕ[Qθ(st,at)−log(πϕ(at∣st)))]2]J_V(\psi) = E_{s_t\sim D}[\frac{1}{2}(V_\psi(s_t)-E_{a_t\sim \pi_\phi}[Q_{\theta}(s_t,a_t)-log(\pi_\phi(a_t|s_t)))]^2]JV​(ψ)=Est​∼D​[21​(Vψ​(st​)−Eat​∼πϕ​​[Qθ​(st​,at​)−log(πϕ​(at​∣st​)))]2]
  2. gradient
    ∇^ψJV(ψ)=∇ψVψ(st)⋅(Vψ(st)−Qθ(st,at)+log(πϕ(at∣st)))\hat\nabla_\psi J_V(\psi)=\nabla_\psi V_\psi(s_t)\cdot(V_\psi(s_t)-Q_\theta(s_t,a_t)+log(\pi_\phi(a_t|s_t)))∇^ψ​JV​(ψ)=∇ψ​Vψ​(st​)⋅(Vψ​(st​)−Qθ​(st​,at​)+log(πϕ​(at​∣st​)))

2.2 soft Q-function

  1. loss function
    JQ(θ)=E(st,at)∼D[12(Qθ(st,at)−Q^(st,at))2]J_Q(\theta)=E_{(s_t,a_t)\sim D}[\frac{1}{2}(Q_\theta(s_t,a_t)-\hat Q(s_t,a_t))^2]JQ​(θ)=E(st​,at​)∼D​[21​(Qθ​(st​,at​)−Q^​(st​,at​))2]
    Q^(st,at)=r(st,at)+γ⋅Est+1∼p[Vψˉ(st+1)]\hat Q(s_t,a_t)=r(s_t,a_t)+\gamma\cdot E_{s_{t+1}\sim p}[V_{\bar{\psi}} (s_{t+1})] Q^​(st​,at​)=r(st​,at​)+γ⋅Est+1​∼p​[Vψˉ​​(st+1​)]
  2. gradient
    ∇^θJQ(θ)=∇θQθ(st,at)⋅[Qθ(st,at)−r(st,at)−γ⋅Vψˉ(st+1)]\hat\nabla_\theta J_Q(\theta)=\nabla_\theta Q_\theta(s_t,a_t)\cdot[Q_\theta(s_t,a_t)-r(s_t,a_t)-\gamma \cdot V_{\bar\psi}(s_{t+1})]∇^θ​JQ​(θ)=∇θ​Qθ​(st​,at​)⋅[Qθ​(st​,at​)−r(st​,at​)−γ⋅Vψˉ​​(st+1​)]

2.3 policy improvement

  1. loss function
    Jπ(ϕ)=Est∼D[DKL(πϕ(⋅∣st)∣∣exp(Qθ(st,⋅))Zθ(st))]J_\pi(\phi)=E_{s_t\sim D}[D_{KL}(\pi_\phi(\cdot|s_t)||\frac{exp(Q_\theta(s_t,\cdot))}{Z_\theta(s_t)})]Jπ​(ϕ)=Est​∼D​[DKL​(πϕ​(⋅∣st​)∣∣Zθ​(st​)exp(Qθ​(st​,⋅))​)]
    reparameterize the policy
    at=fϕ(ϵt;st)=fϕμ(st)+ϵt⋅fϕσ(st)a_t=f_\phi(\epsilon_t;s_t)=f_\phi^\mu(s_t)+\epsilon_t\cdot f_\phi^\sigma(s_t)at​=fϕ​(ϵt​;st​)=fϕμ​(st​)+ϵt​⋅fϕσ​(st​)
    Jπ(ϕ)=Est∼D;ϵt∈N[log(πϕ(fϕ(ϵt;st)∣st))−Qθ(st,fϕ(ϵt;st))]J_\pi(\phi)=E_{s_t\sim D;\epsilon_t\in N}[log(\pi_\phi(f_\phi(\epsilon_t;s_t)|s_t))-Q_\theta(s_t,f_\phi(\epsilon_t;s_t))] Jπ​(ϕ)=Est​∼D;ϵt​∈N​[log(πϕ​(fϕ​(ϵt​;st​)∣st​))−Qθ​(st​,fϕ​(ϵt​;st​))]
  2. gradient
    ∇θEqθ(Z)[fθ(Z)]=Eqθ(Z)[∂fθ(Z)∂θ]+Eqθ(Z)[dfθ(Z)dZ⋅dZdθ]\nabla_\theta E_{q_\theta(Z)}[f_\theta(Z)]=E_{q_\theta(Z)}[\frac{\partial f_\theta(Z)}{\partial \theta}] + E_{q_\theta(Z)}[\frac{df_\theta(Z)}{dZ}\cdot\frac{dZ}{d\theta}]∇θ​Eqθ​(Z)​[fθ​(Z)]=Eqθ​(Z)​[∂θ∂fθ​(Z)​]+Eqθ​(Z)​[dZdfθ​(Z)​⋅dθdZ​]
    ∇^ϕJπ(ϕ)=∇ϕlog(πϕ(at;st)∣st))+∇ϕfϕ(ϵt;st)⋅(∇atlog(π(at∣st))−∇atQθ(st,at))\hat \nabla_\phi J_\pi(\phi)=\nabla_\phi log(\pi_\phi(a_t;s_t)|s_t))+\nabla_{\phi}f_\phi(\epsilon_t;s_t)\cdot(\nabla_{a_t}log(\pi(a_t|s_t))-\nabla_{a_t} Q_\theta(s_t,a_t))∇^ϕ​Jπ​(ϕ)=∇ϕ​log(πϕ​(at​;st​)∣st​))+∇ϕ​fϕ​(ϵt​;st​)⋅(∇at​​log(π(at​∣st​))−∇at​​Qθ​(st​,at​))

3. 算法流程


By CyrusMay 2022.09.06
世界 再大 不过 你和我
用最小回忆 堆成宇宙
————五月天(因为你 所以我)————

Soft Actor-Critic(SAC算法)相关推荐

  1. 强化学习论文笔记:Soft Actor Critic算法

    Soft Actor Critic是伯克利大学团队在2018年的ICML(International Conference on Machine Learning)上发表的off-policy mod ...

  2. CS294(285) Actor Critic算法系列

    CS294(285) Actor Critic算法系列 CS294(285) Actor Critic之agents(https://duanzhihua.blog.csdn.net/article/ ...

  3. 深度增强学习--Actor Critic

    Actor Critic value-based和policy-based的结合 实例代码 1 import sys 2 import gym 3 import pylab 4 import nump ...

  4. 【强化学习笔记】2020 李宏毅 强化学习课程笔记(PPO、Q-Learning、Actor + Critic、Sparse Reward、IRL)

    前言 如果你对这篇文章感兴趣,可以点击「[访客必读 - 指引页]一文囊括主页内所有高质量博客」,查看完整博客分类与对应链接. 文章目录 前言 Introduction Two Learning Mod ...

  5. actor critic玩cartpole

    只能玩到reward=200多,git actor critic采用单步更新,每一步游戏后Actor和Critic都进行学习. Actor网络使用交叉熵损失,是因为r_s为正时需要增加选择a_s的概率 ...

  6. 【强化学习】Actor Critic原理

    PG算法是一种只基于policy的一种方法,存在的问题就是该算法需要完整的状态序列,且单独对策略函数进行迭代更新,不太容易收敛. Actor-critic方法呢是一种将 策略(Policy Based ...

  7. An Actor–Critic based controller for glucose regulation in type 1 diabetes

    a b s t r a c t \qquad 控制器基于Actor-Critic(AC)算法,受强化学习和最优控制理论(optimal control theory)的启发.控制器的主要特性是: 同时 ...

  8. 《Deep Reinforcement Learning for Autonomous Driving: A Survey》笔记

    B Ravi Kiran , Ibrahim Sobh , Victor Talpaert , Patrick Mannion , Ahmad A. Al Sallab, Senthil Yogama ...

  9. 判别两棵树是否相等 设计算法_BAIR最新RL算法超越谷歌Dreamer,性能提升2.8倍

    选自arXiv 作者:Aravind Srinivas等 机器之心编译 参与:Racoon.Jamin pixel-based RL 算法逆袭,BAIR 提出将对比学习与 RL 相结合的算法,其 sa ...

最新文章

  1. 《算法竞赛入门经典》习题——Chapter 3
  2. ElasticSearch 组合过滤器
  3. 从源码看runLoop
  4. W ndoWs7重启按F11没用,windows7无法正常启动按F8也没有效果的解决方法
  5. 小波的秘密5_多分辨率分析和连续小波变换2
  6. Java 8 forEach 示例
  7. vue打开后端html文件,vue中怎么请求后端数据?
  8. 入门案例中使用的组件介绍
  9. linux开发板显示百叶窗图片,03Linux命令操作2
  10. 【人工智能】“看透”神经网络
  11. Office365—Exchange管理2—连接Exchange PowerShell
  12. Excel 提取单元格中的数字、中/英文方法
  13. 互联网发展的四个阶段总结
  14. c语言case小于,大于和小于switch语句C
  15. 群接龙拼团小程序开发
  16. 桌面下雪小程序 WIN32
  17. AI算法面试难度升级,该如何应对?
  18. STC11/10xx系列单片机独立波特率发生器设置
  19. Homography estimation(旋转估计)
  20. R(一)一次R排错的全过程

热门文章

  1. C语言#include的用法详解
  2. CUGBACM22级暑假小学期训练-简单构造
  3. xjb——洛谷P1191 矩形
  4. 一、PyQt基础知识
  5. 问题 A: 第1题面积(area)【2015南海区赛小甲】
  6. 习题:选择结构(二)
  7. 【实用技巧篇】JSch使用介绍,实用JSch实现文件传输
  8. 图像隐写分析-Markov特征编程实现
  9. 微信公众号内嵌页面不执行ajax,微信公众号前端开发(weui)+ajax
  10. poi导出Excel直接在浏览器下载