Learning Robust Rewards With Adversarial Inverse RL

概述
一、问题背景熟悉
- 1.1 GAN-IRL
- 1.2 Reward Shaping
二、主要逻辑
- 2.1 问题定义
- 2.2 论文理论
- 2.3 AIRL怎么做
- - 2.3.1 IRL问题的定义
  - 2.3.2 GCL对IRL问题的处理
  - 2.3.3 AIRL
三、实验设计
四、总结

概述

这篇是以2016 NIPS的GAIL以及2016 ICML的GCL（在之前的Paper中精读过）为基础进行对比，提出一种针对dynamics change鲁棒的reward function recovery方法AIRL。

承上启下的文章在这里https://blog.csdn.net/weixin_40056577/article/details/104738587

这个IRL算法是基于对adversarial reward leaning的一种formulation进行优化的。

AIRL的特点是，recover到的reward是portable以及generalizable的，一个特殊的名词为disentangled rewards，即AIRL从demonstrations中提取出来的reward是对环境dynamics的变化不敏感的。（所以还可以提取rewards对什么不敏感呢？对expert behaviors的扰动不敏感？）

其实Adversarial也是很传统的问题：从一个信息聚合体中如何提取出多种多样独立的信息体，专业一点，即如何解耦decouple？

一、问题背景熟悉

1.1 GAN-IRL

之前的Paper精读，都主要介绍与IRL、Imitation Learning相关的算法，是怎么进行的，流程如何，回顾一下：

GAIL可以看成是Imitation Learning用GAN做数据增广，探讨问题的角度是从专家数据中的统计量角度出发的，即occupancy measure，然后根据该统计量对Policy Update提供一个Guidance，最终找到一个Policy的occupancy measure match专家数据。
GCL可以看成是在IRL的理论背景开始分析，针对专家数据构建了一个PGM图来表示expert trajectory distribution，针对Partition Function采用了利用Policy的轨迹构建方式用Importance sampling进行近似估计。
GAN-IRL这一篇Paper里则说明了GAN与IRL数学上等价的联系。
至此，AIRL则是将解耦思想引入应用到这个IRL与GAN问题的Setting上，使得其更为robust

然后IRL问题有两个关键问题：如何从demonstrations中recover到一个比较合理、科学能正确表示human prior的Reward Function呢？

对于一堆专家数据expert demonstrations，有很多optimal policies可以进行解释
对于一个optimal policy，有很多rewad function都可以进行解释

对于第一个问题Ziebart的MaxEnt IRL framework就是之前说的Soft Optimal Policy，即 p ( a ∣ s , O 1 : T ) p(a|s,O_{1:T}) p(a∣s,O1:T)，从一堆专家数据中推断出一个Sub-Optimal ,Stochastic的概率型Policy来表示很多optiaml polices可解释的问题了。

对于第二个问题，很自然思考，如何将真正表达optimal policy的reward function给取出来。Reward Function很容易受环境Dynamics的影响，所以这篇paper尝试从可能的reward functions中提取中对环境Dynamics鲁棒的reward function，认为它能代表一些东西。

1.2 Reward Shaping

1999年ICML Andrew Ng提出一种Reward Transformation：
r ^ ( s , a , s ′ ) = r ( s , a , s ′ ) + γ Φ ( s ′ ) − Φ ( s ) \hat r(s,a,s')=r(s,a,s')+\gamma\Phi(s')-\Phi(s) r^(s,a,s′)=r(s,a,s′)+γΦ(s′)−Φ(s)

这个函数 Φ ( s ) : S → R \Phi(s):S\rightarrow R Φ(s):S→R可以是任意的。然后对Reward做了这种变换后，Paper证明了它不会改变其对应的optimal policy。

然后2018 ICLR这篇AIRL就经验式地验证了：变换后的reward functions对环境dynamics 的改变不鲁棒

二、主要逻辑

2.1 问题定义

如何表示Optimal Policy？说到Policy一般就两种，一个是策略函数 π ( a ∣ s ) \pi(a|s) π(a∣s)，一个是Q值函数 Q ( s , a ) Q(s,a) Q(s,a)（差不多等价的表述）
在一个MDP即 ( S , A , T , γ , R ) (S,A,T,\gamma,R) (S,A,T,γ,R)中，对Policy的学习有影响的就是状态转移矩阵 T = p ( s t + 1 ∣ s t , a t ) T=p(s_{t+1}|s_t,a_t) T=p(st+1∣st,at)与监督信息的来源Reward，因此对optimal policy表示为 Q r , T ∗ ( s , a ) Q^*_{r,T}(s,a) Qr,T∗(s,a)或者 π r , T ∗ ( a ∣ s ) \pi^*_{r,T}(a|s) πr,T∗(a∣s)
Disentangled Rewards的定义：在某个dynamics set中，我们得到的reward与真实reward是相等的。即 π r , T ∗ ( a ∣ s ) \pi^*_{r,T}(a|s) πr,T∗(a∣s)= π r ′ , T ∗ ( a ∣ s ) \pi^*_{r',T}(a|s) πr′,T∗(a∣s)，其中 r ′ r' r′是model reward， r r r是ground-truth reward Q r ′ , T ∗ ( s , a ) = Q r , T ∗ ( s , a ) − f ( s ) Q^*_{r',T}(s,a)=Q^*_{r,T}(s,a)-f(s) Qr′,T∗(s,a)=Qr,T∗(s,a)−f(s)

2.2 论文理论

理论1

如果环境的dynamics model满足一个decomposability condition，且IRL要recover的奖励函数只与状态有关即 r ′ ( s ) r'(s) r′(s)，然后就能产生一个optimal policy:
Q r ′ , T ∗ ( s , a ) = Q r , T ∗ ( s , a ) − f ( s ) Q^*_{r',T}(s,a)=Q^*_{r,T}(s,a)-f(s) Qr′,T∗(s,a)=Qr,T∗(s,a)−f(s)

（简单说：给环境动态特性T加一个约束条件，假设奖励函数形式仅与状态有关，就能保证RL得到的reward具有Disentangled的特性，即optimal policy有 Q r ′ , T ∗ ( s , a ) = Q r , T ∗ ( s , a ) − f ( s ) Q^*_{r',T}(s,a)=Q^*_{r,T}(s,a)-f(s) Qr′,T∗(s,a)=Qr,T∗(s,a)−f(s)

理论2

如果一个reward function即 r ′ ( s , a , s ′ ) r'(s,a,s') r′(s,a,s′)对所有dynamics function是disentangled的，就能推断出这是一个state only的reward function形式。（没什么用）

（简单说，若一个reward在所有dynamics下是disentangled的，则其形式为state-only）

Paper的分析，最主要的贡献点就是：如果仅仅在一个MDP中学习的话，那么尽可能使reward的形式仅与state有关

2.3 AIRL怎么做

GAN-IRL-Energy Model在这一篇文章里揭示了GAN与IRL的联系。这里简单回顾一下：

2.3.1 IRL问题的定义

对专家轨迹数据建模 p θ ( τ ) p_\theta(\tau) pθ(τ)，参数化对象是 r θ r_\theta rθ

p θ ( τ ) = 1 Z e x p ( − c θ ( τ ) ) = 1 Z e x p ( r θ ( τ ) ) p_\theta(\tau)=\frac{1}{Z}exp(-c_\theta(\tau))=\frac{1}{Z}exp(r_\theta(\tau)) pθ(τ)=Z1exp(−cθ(τ))=Z1exp(rθ(τ))

对专家数据的最大似然目标：

min ⁡ θ L c o s t ( θ ) = min ⁡ θ E τ ∼ p [ − l o g p θ ( τ ) ] = max ⁡ θ E τ ∼ p [ l o g p θ ( τ ) ] = max ⁡ θ E τ ∼ p [ r θ ( τ ) ] − l o g Z Z = ∫ e x p ( r θ ( τ ) ) d τ \begin{aligned} \min_\theta L_{cost}(\theta)&=\min_\theta E_{\tau\sim p}[-logp_\theta(\tau)]\\ &=\max_\theta E_{\tau\sim p}[logp_\theta(\tau)]\\ &=\max_\theta E_{\tau\sim p}[r_\theta(\tau)]-logZ\\ Z&=\int exp(r_\theta(\tau))d\tau \end{aligned} θminLcost(θ)Z=θminEτ∼p[−logpθ(τ)]=θmaxEτ∼p[logpθ(τ)]=θmaxEτ∼p[rθ(τ)]−logZ=∫exp(rθ(τ))dτ

如果写成state-action的形式：
p θ ( τ ) ∝ p ( s 0 ) ∏ t = 1 T exp ⁡ ( r θ ( s t , a t ) ) p ( s t + 1 ∣ s t , a t ) p_\theta(\tau)\propto p(s_0)\prod_{t=1}^{T}\exp(r_\theta(s_t,a_t))p(s_{t+1}|s_t,a_t) pθ(τ)∝p(s0)t=1∏Texp(rθ(st,at))p(st+1∣st,at)

2.3.2 GCL对IRL问题的处理

GCL中引入了一个sampling distribution即 q ( τ ) q(\tau) q(τ)来解决这个Partition Function Z的计算问题：

max ⁡ θ E τ ∼ p [ r θ ( τ ) ] − l o g Z = max ⁡ θ E τ ∼ p [ r θ ( τ ) ] − l o g ( E τ ∼ q ( τ ) [ e x p ( r θ ( τ ) ) q ( τ ) ] ) \begin{aligned} &\max_\theta E_{\tau\sim p}[r_\theta(\tau)]-logZ\\ &=\max_\theta E_{\tau\sim p}[r_\theta(\tau)]-log\Big(E_{\tau\sim q(\tau)}[\frac{exp(r_\theta(\tau))}{q(\tau)}]\Big) \end{aligned} θmaxEτ∼p[rθ(τ)]−logZ=θmaxEτ∼p[rθ(τ)]−log(Eτ∼q(τ)[q(τ)exp(rθ(τ))])

然后这个问题就很自然变成了一个GAN的优化问题，在引入的Sampling Distribution与Reward Function之间进行迭代。判别器为Reward，生成器为Sampling Distribution。判别器形式为:

D θ ( τ ) = e x p ( r θ ( τ ) ) e x p ( r θ ( τ ) ) + q ( τ ) D_\theta(\tau)=\frac{exp(r_\theta(\tau))}{exp(r_\theta(\tau))+q(\tau)} Dθ(τ)=exp(rθ(τ))+q(τ)exp(rθ(τ))

这是一种trajectory-centric formulation，Paper提出把它变成下面这种形式：
D θ ( s , a ) = e x p ( f θ ( s , a ) ) e x p ( f θ ( s , a ) ) + π ( a ∣ s ) D_\theta(s,a)=\frac{exp(f_\theta(s,a))}{exp(f_\theta(s,a))+\pi(a|s)} Dθ(s,a)=exp(fθ(s,a))+π(a∣s)exp(fθ(s,a))

2.3.3 AIRL

理论上说了reward是state-only的时候，更可能对dynamics robust，而且shaping reward之后会对dynamics不robust，因此需要多参数化一个shaping term函数 h h h，参数为 ϕ \phi ϕ：
D θ , ϕ ( s , a , s ′ ) = e x p ( f θ , ϕ ( s , a , s ′ ) ) e x p ( f θ , ϕ ( s , a , s ′ ) ) + π ( a ∣ s ) D_{\theta,\phi}(s,a,s')=\frac{exp(f_{\theta,\phi}(s,a,s'))}{exp(f_{\theta,\phi}(s,a,s'))+\pi(a|s)} Dθ,ϕ(s,a,s′)=exp(fθ,ϕ(s,a,s′))+π(a∣s)exp(fθ,ϕ(s,a,s′))

其中 π ( a ∣ s ) \pi(a|s) π(a∣s)是sampling的policy， f θ , ϕ ( s , a , s ′ ) f_{\theta,\phi}(s,a,s') fθ,ϕ(s,a,s′)是reward，其为：
f θ , ϕ ( s , a , s ′ ) = g θ ( s , a ) + γ h ϕ ( s ′ ) − h ϕ ( s ) f_{\theta,\phi}(s,a,s')=g_\theta(s,a)+\gamma h_\phi(s')-h_\phi(s) fθ,ϕ(s,a,s′)=gθ(s,a)+γhϕ(s′)−hϕ(s)

g θ ( s , a ) g_\theta(s,a) gθ(s,a)是reward approximator， h ϕ ( s ) h_\phi(s) hϕ(s)是一个shaping term。

三、实验设计

整个实验围绕两个问题进行：

AIRL是否这能学到对环境dynamics robust的disentangled reward？（通过改变dynamics对学到的reward进行测试）
AIRL能否解决high-dimensions的连续控制任务？efficient and scalable？

第一个是用来验证disentangled reward在transfer的时候是否robust，且是state-only有效还是state-action的reward函数形式有效。

这个任务没有在transfer setting的情况下做，而是test in training set，主要用来对比AIRL是否合适high-dimensions的连续控制任务。

四、总结

这一篇是在2016 NIPS的GAIL与2018 ICML的GCL基础上继续探究问题的工作，首先将问题的背景扩展到transfer setting，然后是纯粹在IRL的领域深入，去recover一个比较robust的reward
主要贡献是探讨了在IRL的目的下，什么样的dynamics能弄出一个可以transfer与portable的reward function而不是GAIL那样更偏向Imitation Learning的做法
比较有意义的探讨：reward shaping对dynamics不太robust、reward的形式与dynamics约束条件之间的关系

一句话总结：利用IRL去recover一个对dynamics robust、符合transfer setting的disentangled reward function，是一个在state-action层面recover到较完整reward function的算法。

具体值得借鉴的地方：

怎么在IRL这个问题中引入对dynamics robust的disentangled reward？
定义了disentangled reward又是如何探究与证明dynamics相关的理论？
如何在理论指导下，参数化reward shaping term？为啥reward shaping会影响到optimal policy对dynamics的鲁棒性？

代码：https://sites.google.com/view/adversarial-irl

Paper-7 精读AIRL Learn Robust Reward （2018 ICLR）相关推荐

2018 ICLR | GRAPH ATTENTION NETWORKS
Paper: https://arxiv.org/pdf/1710.10903 2018 ICLR | GRAPH ATTENTION NETWORKS 摘要作者提出了图注意网络(GATs),一种基 ...
[paper] multi-human parsing (MHP) (Zhao et al., 2018) dataset.
Towards Real World Human Parsing: Multiple-Human Parsing in the Wild Paper: https://arxiv.org/pdf/17 ...
Paper Reading: Papers in Frontiers of NLP 2018 collection
1.Papers collections Note: the original name of the paper will be appended soonly! Index Paper Year ...
python编程控制机器人_python人工智能机器人工具书籍: Learn Robotics Programming 2018
简介人工智能和智能机器人将精确有效地执行不同的任务. Raspberry Pi和Python的组合在制作这些机器人时非常有效. 本书首先向您介绍机器人的基本结构,以及如何规划,构建和编程. 当您完成 ...
106.精读《数据之上·智慧之光 - 2018》
1. 引言本周精读内容是:<数据之上智慧之光>,由帆软软件公司出品. 帆软公司是国内一家做大数据 BI 和分析平台的提供商,主打产品是 FineBI.笔者所在阿里数据中台也处于数据分析 ...
精读《数据之上·智慧之光 - 2018》
1. 引言本周精读内容是:<数据之上智慧之光>,由帆软软件公司出品. 帆软公司是国内一家做大数据 BI 和分析平台的提供商,主打产品是 FineBI.笔者所在阿里数据中台也处于数据分析 ...
【Paper】Deep Learning for Anomaly Detection:A survey
论文原文:PDF 论文年份:2019 论文被引:253(2020/10/05) 922(2022/03/26) 文章目录 ABSTRACT 1 Introduction 2 What are anom ...
ICLR2020国际会议焦点论文(Spotlight Paper)列表（内含论文源码）
来源:AINLPer微信公众号(点击了解一下吧) 编辑: ShuYini 校稿: ShuYini 时间: 2020-02-21 2020年的ICLR会议将于今年的4月26日-4月30日在Mil ...
Within-sample variability-invariant loss for robust speaker recognition under noisy environments
Within-sample variability-invariant loss for robust speaker recognition under noisy environments 标题: ...

Paper-7 精读AIRL Learn Robust Reward （2018 ICLR）