Soft Actor-Critic(SAC算法)
强化学习—— Soft Actor-Critic(SAC算法
- 1. 基本概念
- 1.1 soft Q-value
- 1.2 soft state value function
- 1.3 Soft Policy Evaluation
- 1.4 policy improvement
- 1.5 soft policy improvemrnt
- 1.5 soft policy iteration
- 2. soft actor critic
- 2.1 soft value function
- 2.2 soft Q-function
- 2.3 policy improvement
- 3. 算法流程
1. 基本概念
1.1 soft Q-value
τπQ(st,at)=r(st,at)+γ⋅Est+1∼p[V(st+1)]\tau ^\pi Q(s_t,a_t)=r(s_t,a_t) + \gamma \cdot E_{s_{t +1}\sim p}[V(s_{t+1})]τπQ(st,at)=r(st,at)+γ⋅Est+1∼p[V(st+1)]
1.2 soft state value function
V(st)=Eat∼π[Q(st,at)−α⋅logπ(at∣st)]V(s_t)=E_{a_t \sim \pi}[Q(s_t,a_t)-\alpha \cdot log\pi(a_t|s_t)]V(st)=Eat∼π[Q(st,at)−α⋅logπ(at∣st)]
1.3 Soft Policy Evaluation
Qk+1=τπQkQ^{k+1}=\tau^\pi Q^kQk+1=τπQk
当k趋于无穷时,QkQ^kQk将收敛至π\piπ的soft Q-value。
证明:
rπ(st,at)=r(st,at)+γ⋅Est+1∼p[H(π(⋅∣st+1))]r_\pi(s_t,a_t)=r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p}[H(\pi(\cdot | s_{t+1}))]rπ(st,at)=r(st,at)+γ⋅Est+1∼p[H(π(⋅∣st+1))]
Q(st,at)=r(st,at)+γ⋅Est+1∼p[H(π(⋅∣st+1))+Est+1,at+1∼ρπ[Q(st+1,at+1)]Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p}[H(\pi(\cdot | s_{t+1})) + E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})]Q(st,at)=r(st,at)+γ⋅Est+1∼p[H(π(⋅∣st+1))+Est+1,at+1∼ρπ[Q(st+1,at+1)]
Q(st,at)=r(st,at)+γ⋅Est+1,at+1∼ρπ[−log(π(at+1∣st+1))+Est+1,at+1∼ρπ[Q(st+1,at+1)]Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1},a_{t+1}\sim \rho_\pi}[-log(\pi(a_{t+1} | s_{t+1})) + E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})]Q(st,at)=r(st,at)+γ⋅Est+1,at+1∼ρπ[−log(π(at+1∣st+1))+Est+1,at+1∼ρπ[Q(st+1,at+1)]
Q(st,at)=r(st,at)+γ⋅Est+1,at+1∼ρπ[Q(st+1,at+1)−log(π(at+1∣st+1))Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})-log(\pi(a_{t+1} | s_{t+1})) Q(st,at)=r(st,at)+γ⋅Est+1,at+1∼ρπ[Q(st+1,at+1)−log(π(at+1∣st+1))
当|A|<∞时,可以保证熵有界,因而能保证收敛。
1.4 policy improvement
πnew=argminπ′∈ΠDKL(π′(⋅∣st)∣∣exp(Qπold(st,⋅))Zπold(st))\pi_{new}=argmin_{\pi^{'}\in \Pi}D_{KL}(\pi^{'}(\cdot|s_t)||\frac{exp(Q^{\pi_{old}}(s_t,\cdot))}{Z^{\pi_{old}}(s_t)})πnew=argminπ′∈ΠDKL(π′(⋅∣st)∣∣Zπold(st)exp(Qπold(st,⋅)))
1.5 soft policy improvemrnt
Qπnew(st,at)≥Qπold(st,at)Q^{\pi_{new}}(s_t,a_t)≥Q^{\pi_{old}}(s_t,a_t)Qπnew(st,at)≥Qπold(st,at)
s.t.为:
πold∈Π,(st,at)∈S×A,∣A∣<∞\pi_{old}\in \Pi,(s_t,a_t)\in S × A, |A| < ∞πold∈Π,(st,at)∈S×A,∣A∣<∞
证明如下:
πnew=argminπ′∈ΠDKL(π′(⋅∣st)∣∣exp(Qπold(st,⋅)−log(Z(st))))=argminπ′∈ΠJπold(π′(⋅∣st))\pi_{new}=argmin_{\pi^{'}\in \Pi}D_{KL}(\pi^{'}(\cdot|s_t)||exp(Q^{\pi_{old}}(s_t,\cdot)-log(Z(s_t))))\\ =argmin_{\pi^{'}\in \Pi}J_{\pi_{old}}(\pi^{'}(\cdot|s_t))πnew=argminπ′∈ΠDKL(π′(⋅∣st)∣∣exp(Qπold(st,⋅)−log(Z(st))))=argminπ′∈ΠJπold(π′(⋅∣st))
Jπold(π′(⋅∣st))=Eat∼π′[log(π′(st,at))−Qπold(st,at)+log(Z(st))]J_{\pi_{old}}(\pi^{'}(\cdot|s_t)) = E_{a_t \sim \pi^{'}}[log(\pi^{'}(s_t,a_t))-Q^{\pi_{old}}(s_t,a_t)+log(Z(s_t))]Jπold(π′(⋅∣st))=Eat∼π′[log(π′(st,at))−Qπold(st,at)+log(Z(st))]
由于一直可以取πnew=πold\pi_{new}=\pi_{old}πnew=πold,所有总能满足:
Eat∼πnew[log(πnew(at∣st))−Qπold(st,at)]≤Eat∈πold[log(πold(at∣st))−Qπold(st,at)]E_{a_t\sim \pi_{new}}[log(\pi_{new}(a_t|s_t))-Q^{\pi_{old}}(s_t,a_t)]≤E_{a_t \in \pi_{old}}[log(\pi_{old}(a_t|s_t))-Q^{\pi_{old}}(s_t,a_t)]Eat∼πnew[log(πnew(at∣st))−Qπold(st,at)]≤Eat∈πold[log(πold(at∣st))−Qπold(st,at)]
Eat∼πnew[log(πnew(at∣st))−Qπold(st,at)]≤−Vπold(st)Eat∼πnew[Qπold(st,at)−log(πnew(at∣st))]≥Vπold(st)E_{a_t\sim \pi_{new}}[log(\pi_{new}(a_t|s_t))-Q^{\pi_{old}}(s_t,a_t)]≤ - V^{\pi_{old}}(s_t)\\E_{a_t\sim \pi_{new}}[Q^{\pi_{old}}(s_t,a_t)-log(\pi_{new}(a_t|s_t))]≥V^{\pi_{old}}(s_t)Eat∼πnew[log(πnew(at∣st))−Qπold(st,at)]≤−Vπold(st)Eat∼πnew[Qπold(st,at)−log(πnew(at∣st))]≥Vπold(st)
Qπold(st,at)=r(st,at)+γ⋅Est+1∼p[Vπold(st+1)]≤r(st,at)+γ⋅Est+1∼pEat+1∼πnew[Qπold(st,at)−log(πnew(at∣st)]≤..........≤Qπnew(st,at)Q^{\pi_{old}}(s_t,a_t)=r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p }[V^{\pi_{old}}(s_{t+1})]\\ ≤r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p E_{a_{t+1}\sim \pi_{new}}}[Q^{\pi_{old}}(s_t,a_t)-log(\pi_{new}(a_t|s_t)]\\ ≤..........\\ ≤Q^{\pi_{new}}(s_t,a_t)Qπold(st,at)=r(st,at)+γ⋅Est+1∼p[Vπold(st+1)]≤r(st,at)+γ⋅Est+1∼pEat+1∼πnew[Qπold(st,at)−log(πnew(at∣st)]≤..........≤Qπnew(st,at)
1.5 soft policy iteration
假设:∣A∣<∞;π∈Π|A|<∞;\pi\in\Pi∣A∣<∞;π∈Π
经过不断地soft policy evaluation和policy improvement,最终policy会收敛至π⋆\pi^{\star}π⋆,其满足
Qπ⋆(st,at)≥Qπ(st,at);其中π∈ΠQ^{\pi^\star}(s_t,a_t)≥Q^{\pi}(s_t,a_t);其中\pi\in\PiQπ⋆(st,at)≥Qπ(st,at);其中π∈Π
2. soft actor critic
2.1 soft value function
- loss function
JV(ψ)=Est∼D[12(Vψ(st)−Eat∼πϕ[Qθ(st,at)−log(πϕ(at∣st)))]2]J_V(\psi) = E_{s_t\sim D}[\frac{1}{2}(V_\psi(s_t)-E_{a_t\sim \pi_\phi}[Q_{\theta}(s_t,a_t)-log(\pi_\phi(a_t|s_t)))]^2]JV(ψ)=Est∼D[21(Vψ(st)−Eat∼πϕ[Qθ(st,at)−log(πϕ(at∣st)))]2] - gradient
∇^ψJV(ψ)=∇ψVψ(st)⋅(Vψ(st)−Qθ(st,at)+log(πϕ(at∣st)))\hat\nabla_\psi J_V(\psi)=\nabla_\psi V_\psi(s_t)\cdot(V_\psi(s_t)-Q_\theta(s_t,a_t)+log(\pi_\phi(a_t|s_t)))∇^ψJV(ψ)=∇ψVψ(st)⋅(Vψ(st)−Qθ(st,at)+log(πϕ(at∣st)))
2.2 soft Q-function
- loss function
JQ(θ)=E(st,at)∼D[12(Qθ(st,at)−Q^(st,at))2]J_Q(\theta)=E_{(s_t,a_t)\sim D}[\frac{1}{2}(Q_\theta(s_t,a_t)-\hat Q(s_t,a_t))^2]JQ(θ)=E(st,at)∼D[21(Qθ(st,at)−Q^(st,at))2]
Q^(st,at)=r(st,at)+γ⋅Est+1∼p[Vψˉ(st+1)]\hat Q(s_t,a_t)=r(s_t,a_t)+\gamma\cdot E_{s_{t+1}\sim p}[V_{\bar{\psi}} (s_{t+1})] Q^(st,at)=r(st,at)+γ⋅Est+1∼p[Vψˉ(st+1)] - gradient
∇^θJQ(θ)=∇θQθ(st,at)⋅[Qθ(st,at)−r(st,at)−γ⋅Vψˉ(st+1)]\hat\nabla_\theta J_Q(\theta)=\nabla_\theta Q_\theta(s_t,a_t)\cdot[Q_\theta(s_t,a_t)-r(s_t,a_t)-\gamma \cdot V_{\bar\psi}(s_{t+1})]∇^θJQ(θ)=∇θQθ(st,at)⋅[Qθ(st,at)−r(st,at)−γ⋅Vψˉ(st+1)]
2.3 policy improvement
- loss function
Jπ(ϕ)=Est∼D[DKL(πϕ(⋅∣st)∣∣exp(Qθ(st,⋅))Zθ(st))]J_\pi(\phi)=E_{s_t\sim D}[D_{KL}(\pi_\phi(\cdot|s_t)||\frac{exp(Q_\theta(s_t,\cdot))}{Z_\theta(s_t)})]Jπ(ϕ)=Est∼D[DKL(πϕ(⋅∣st)∣∣Zθ(st)exp(Qθ(st,⋅)))]
reparameterize the policy
at=fϕ(ϵt;st)=fϕμ(st)+ϵt⋅fϕσ(st)a_t=f_\phi(\epsilon_t;s_t)=f_\phi^\mu(s_t)+\epsilon_t\cdot f_\phi^\sigma(s_t)at=fϕ(ϵt;st)=fϕμ(st)+ϵt⋅fϕσ(st)
Jπ(ϕ)=Est∼D;ϵt∈N[log(πϕ(fϕ(ϵt;st)∣st))−Qθ(st,fϕ(ϵt;st))]J_\pi(\phi)=E_{s_t\sim D;\epsilon_t\in N}[log(\pi_\phi(f_\phi(\epsilon_t;s_t)|s_t))-Q_\theta(s_t,f_\phi(\epsilon_t;s_t))] Jπ(ϕ)=Est∼D;ϵt∈N[log(πϕ(fϕ(ϵt;st)∣st))−Qθ(st,fϕ(ϵt;st))] - gradient
∇θEqθ(Z)[fθ(Z)]=Eqθ(Z)[∂fθ(Z)∂θ]+Eqθ(Z)[dfθ(Z)dZ⋅dZdθ]\nabla_\theta E_{q_\theta(Z)}[f_\theta(Z)]=E_{q_\theta(Z)}[\frac{\partial f_\theta(Z)}{\partial \theta}] + E_{q_\theta(Z)}[\frac{df_\theta(Z)}{dZ}\cdot\frac{dZ}{d\theta}]∇θEqθ(Z)[fθ(Z)]=Eqθ(Z)[∂θ∂fθ(Z)]+Eqθ(Z)[dZdfθ(Z)⋅dθdZ]
∇^ϕJπ(ϕ)=∇ϕlog(πϕ(at;st)∣st))+∇ϕfϕ(ϵt;st)⋅(∇atlog(π(at∣st))−∇atQθ(st,at))\hat \nabla_\phi J_\pi(\phi)=\nabla_\phi log(\pi_\phi(a_t;s_t)|s_t))+\nabla_{\phi}f_\phi(\epsilon_t;s_t)\cdot(\nabla_{a_t}log(\pi(a_t|s_t))-\nabla_{a_t} Q_\theta(s_t,a_t))∇^ϕJπ(ϕ)=∇ϕlog(πϕ(at;st)∣st))+∇ϕfϕ(ϵt;st)⋅(∇atlog(π(at∣st))−∇atQθ(st,at))
3. 算法流程
By CyrusMay 2022.09.06
世界 再大 不过 你和我
用最小回忆 堆成宇宙
————五月天(因为你 所以我)————
Soft Actor-Critic(SAC算法)相关推荐
- 强化学习论文笔记:Soft Actor Critic算法
Soft Actor Critic是伯克利大学团队在2018年的ICML(International Conference on Machine Learning)上发表的off-policy mod ...
- CS294(285) Actor Critic算法系列
CS294(285) Actor Critic算法系列 CS294(285) Actor Critic之agents(https://duanzhihua.blog.csdn.net/article/ ...
- 深度增强学习--Actor Critic
Actor Critic value-based和policy-based的结合 实例代码 1 import sys 2 import gym 3 import pylab 4 import nump ...
- 【强化学习笔记】2020 李宏毅 强化学习课程笔记(PPO、Q-Learning、Actor + Critic、Sparse Reward、IRL)
前言 如果你对这篇文章感兴趣,可以点击「[访客必读 - 指引页]一文囊括主页内所有高质量博客」,查看完整博客分类与对应链接. 文章目录 前言 Introduction Two Learning Mod ...
- actor critic玩cartpole
只能玩到reward=200多,git actor critic采用单步更新,每一步游戏后Actor和Critic都进行学习. Actor网络使用交叉熵损失,是因为r_s为正时需要增加选择a_s的概率 ...
- 【强化学习】Actor Critic原理
PG算法是一种只基于policy的一种方法,存在的问题就是该算法需要完整的状态序列,且单独对策略函数进行迭代更新,不太容易收敛. Actor-critic方法呢是一种将 策略(Policy Based ...
- An Actor–Critic based controller for glucose regulation in type 1 diabetes
a b s t r a c t \qquad 控制器基于Actor-Critic(AC)算法,受强化学习和最优控制理论(optimal control theory)的启发.控制器的主要特性是: 同时 ...
- 《Deep Reinforcement Learning for Autonomous Driving: A Survey》笔记
B Ravi Kiran , Ibrahim Sobh , Victor Talpaert , Patrick Mannion , Ahmad A. Al Sallab, Senthil Yogama ...
- 判别两棵树是否相等 设计算法_BAIR最新RL算法超越谷歌Dreamer,性能提升2.8倍
选自arXiv 作者:Aravind Srinivas等 机器之心编译 参与:Racoon.Jamin pixel-based RL 算法逆袭,BAIR 提出将对比学习与 RL 相结合的算法,其 sa ...
最新文章
- 《算法竞赛入门经典》习题——Chapter 3
- ElasticSearch 组合过滤器
- 从源码看runLoop
- W ndoWs7重启按F11没用,windows7无法正常启动按F8也没有效果的解决方法
- 小波的秘密5_多分辨率分析和连续小波变换2
- Java 8 forEach 示例
- vue打开后端html文件,vue中怎么请求后端数据?
- 入门案例中使用的组件介绍
- linux开发板显示百叶窗图片,03Linux命令操作2
- 【人工智能】“看透”神经网络
- Office365—Exchange管理2—连接Exchange PowerShell
- Excel 提取单元格中的数字、中/英文方法
- 互联网发展的四个阶段总结
- c语言case小于,大于和小于switch语句C
- 群接龙拼团小程序开发
- 桌面下雪小程序 WIN32
- AI算法面试难度升级,该如何应对?
- STC11/10xx系列单片机独立波特率发生器设置
- Homography estimation(旋转估计)
- R(一)一次R排错的全过程