Video-to-Video Synthesis（NeurIPS18）

image-to-image translation是一个被广泛研究的问题，而video-to-video synthesis则是它的升级版，受到的关注较少

如果不考虑temporal dynamics，直接使用image-to-image translation的方法会生成不连贯（incoherent）、低质量的视频

1 Introduction

据作者所知，之前还没有工作专门提出a general-purpose solution to video-to-video synthesis

本文将video-to-video synthesis定义为distribution matching problem

3 Video-to-Video Synthesis

定义source video frames为 s 1 T = { s 1 , s 2 , ⋯ , s T } \mathbf{s}_1^T=\left \{ \mathbf{s}_1, \mathbf{s}_2, \cdots, \mathbf{s}_T \right \} s1T={s1,s2,⋯,sT}，corresponding real video frames（相当于ground-truth）为 x 1 T = { x 1 , x 2 , ⋯ , x T } \mathbf{x}_1^T=\left \{ \mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_T \right \} x1T={x1,x2,⋯,xT}，生成的output video frames为 x ~ 1 T = { x ~ 1 , x ~ 2 , ⋯ , x ~ T } \tilde{\mathbf{x}}_1^T=\left \{ \tilde{\mathbf{x}}_1, \tilde{\mathbf{x}}_2, \cdots, \tilde{\mathbf{x}}_T \right \} x~1T={x~1,x~2,⋯,x~T}

学习的目标为
p ( x ~ 1 T ∣ s 1 T ) = p ( x 1 T ∣ s 1 T ) ( 1 ) p\left ( \tilde{\mathbf{x}}_1^T\mid \mathbf{s}_1^T \right )=p\left ( \mathbf{x}_1^T\mid \mathbf{s}_1^T \right ) \qquad(1) p(x~1T∣s1T)=p(x1T∣s1T)(1)

定义生成器 G G G来表达 p ( x ~ 1 T ∣ s 1 T ) p\left ( \tilde{\mathbf{x}}_1^T\mid \mathbf{s}_1^T \right ) p(x~1T∣s1T)， x ~ 1 T = G ( s 1 T ) \tilde{\mathbf{x}}_1^T=G(\mathbf{s}_1^T) x~1T=G(s1T)，以及判别器 D D D，则基于GAN的优化目标可以写作：
max ⁡ D min ⁡ G E ( x 1 T , s 1 T ) [ log ⁡ D ( x 1 T , s 1 T ) ] + E s 1 T [ log ⁡ ( 1 − D ( G ( s 1 T ) , s 1 T ) ) ] ( 2 ) \underset{D}{\max}\ \underset{G}{\min}\ E_{\left ( \mathbf{x}_1^T,\mathbf{s}_1^T \right )}\left [ \log D\left ( \mathbf{x}_1^T, \mathbf{s}_1^T \right ) \right ]+E_{\mathbf{s}_1^T}\left [ \log\left ( 1-D\left ( G\left ( \mathbf{s}_1^T \right ), \mathbf{s}_1^T \right ) \right ) \right ] \qquad(2) Dmax Gmin E(x1T,s1T)[logD(x1T,s1T)]+Es1T[log(1−D(G(s1T),s1T))](2)
注意 D D D的输入有2个

Sequential generator

为了简化问题，作出Markov assumption，将条件概率 p ( x ~ 1 T ∣ s 1 T ) p\left ( \tilde{\mathbf{x}}_1^T\mid \mathbf{s}_1^T \right ) p(x~1T∣s1T)分解为
p ( x ~ 1 T ∣ s 1 T ) = ∏ t = 1 T p ( x ~ t ∣ x ~ t − L t − 1 , s t − L t ) ( 3 ) p\left ( \tilde{\mathbf{x}}_1^T\mid \mathbf{s}_1^T \right )=\prod_{t=1}^{T}p\left ( \tilde{\mathbf{x}}_t\mid \tilde{\mathbf{x}}_{t-L}^{t-1}, \mathbf{s}_{t-L}^t \right ) \qquad(3) p(x~1T∣s1T)=t=1∏Tp(x~t∣x~t−Lt−1,st−Lt)(3)
上述式子的意思是，假设我们已经生成了前 t − 1 t-1 t−1帧 x ~ 1 t − 1 \tilde{\mathbf{x}}_1^{t-1} x~1t−1，当前需要生成第 t t t帧 x ~ t \tilde{\mathbf{x}}_t x~t，使用的信息有

current source frame s t \mathbf{s}_t st
past L L L source frames s t − L t − 1 \mathbf{s}_{t-L}^{t-1} st−Lt−1
past L L L generated frames x ~ t − L t − 1 \tilde{\mathbf{x}}_{t-L}^{t-1} x~t−Lt−1

其中1和2可以合并为 s t − L t \mathbf{s}_{t-L}^t st−Lt， L L L是一个超参数，取值小会造成训练不稳定，取值大会增大GPU消耗，因此在实验中设置 L = 2 L=2 L=2比较合适

将公式(3)中的条件概率 p ( x ~ t ∣ x ~ t − L t − 1 , s t − L t ) p\left ( \tilde{\mathbf{x}}_t\mid \tilde{\mathbf{x}}_{t-L}^{t-1}, \mathbf{s}_{t-L}^t \right ) p(x~t∣x~t−Lt−1,st−Lt)表达为网络 F F F， x ~ t = F ( x ~ t − L t − 1 , s t − L t ) \tilde{\mathbf{x}}_t=F\left ( \tilde{\mathbf{x}}_{t-L}^{t-1}, \mathbf{s}_{t-L}^t \right ) x~t=F(x~t−Lt−1,st−Lt)，于是就可以利用网络 F F F逐帧地生成视频

在视频中前后帧之间往往是高度相似的，因此考虑使用光流法，如果前后两帧之间的光流已知，那么可以通过warping前一帧来生成下一帧

具体来说，网络 F F F表达为
F ( x ~ t − L t − 1 , s t − L t ) = ( 1 − m ~ t ) ⊙ w ~ t − 1 ( x ~ t − 1 ) + m ~ t ⊙ h ~ t ( 4 ) F\left ( \tilde{\mathbf{x}}_{t-L}^{t-1}, \mathbf{s}_{t-L}^t \right )=\left ( 1-\tilde{\mathbf{m}}_t \right )\odot \tilde{\mathbf{w}}_{t-1}\left ( \tilde{\mathbf{x}}_{t-1} \right )+\tilde{\mathbf{m}}_t\odot \tilde{\mathbf{h}}_t \qquad(4) F(x~t−Lt−1,st−Lt)=(1−m~t)⊙w~t−1(x~t−1)+m~t⊙h~t(4)
第1项为光流法中warp前一帧的结果，第2项为生成的图像（hallucinates new pixels），二者使用mask m ~ t \tilde{\mathbf{m}}_t m~t做权衡
Q： w ~ t − 1 ( x ~ t − 1 ) \tilde{\mathbf{w}}_{t-1}\left ( \tilde{\mathbf{x}}_{t-1} \right ) w~t−1(x~t−1)对应了warp操作，可以直接理解为矩阵乘法吗

网络 F F F的运算中又涉及到了3个网络 W W W， H H H和 M M M

x ~ t − 1 = W ( x ~ t − L t − 1 , s t − L t ) \tilde{\mathbf{x}}_{t-1}=W\left ( \tilde{\mathbf{x}}_{t-L}^{t-1}, \mathbf{s}_{t-L}^t \right ) x~t−1=W(x~t−Lt−1,st−Lt)表示从帧 x ~ t − 1 \tilde{\mathbf{x}}_{t-1} x~t−1到 x ~ t \tilde{\mathbf{x}}_t x~t，使用optical flow prediction network W W W预测的光流
h ~ t = H ( x ~ t − L t − 1 , s t − L t ) \tilde{\mathbf{h}}_t=H\left ( \tilde{\mathbf{x}}_{t-L}^{t-1}, \mathbf{s}_{t-L}^t \right ) h~t=H(x~t−Lt−1,st−Lt)表示由generator H H H生成的hallucinated image
m ~ t = M ( x ~ t − L t − 1 , s t − L t ) \tilde{\mathbf{m}}_t=M\left ( \tilde{\mathbf{x}}_{t-L}^{t-1}, \mathbf{s}_{t-L}^t \right ) m~t=M(x~t−Lt−1,st−Lt)表示由mask prediction network M M M生成的occlusion mask，未被遮挡的部分可以使用光流解决，被遮挡的部分只能从 h ~ t \tilde{\mathbf{h}}_t h~t中获取

训练网络 F F F的时候，必须采用coarse-to-fine的方式

【关于光流】
光流定义为图像中的像素的运动速度，前后两帧之间的光流需要使用特定算法（Gunner Farneback’s algorithm）来计算，在本文中使用FlowNet2来计算

Conditional image discriminator

定义image级别的conditional 判别器 D I D_I DI，用于判别真实的pair ( x t , s t ) \left ( \mathbf{x}_t, \mathbf{s}_t \right ) (xt,st)和假的pair ( x ~ t , s t ) \left ( \tilde{\mathbf{x}}_t, \mathbf{s}_t \right ) (x~t,st)

Conditional video discriminator

定义video级别的conditional 判别器 D V D_V DV，给定光流作为条件，判别真假output frames

具体来说，对于连续的K个real images x t − K t − 1 \mathbf{x}_{t-K}^{t-1} xt−Kt−1，其光流序列为 w t − K t − 2 \mathbf{w}_{t-K}^{t-2} wt−Kt−2，那么 D V D_V DV负责判别真实的pair ( x t − K t − 1 , w t − K t − 2 ) \left ( \mathbf{x}_{t-K}^{t-1}, \mathbf{w}_{t-K}^{t-2} \right ) (xt−Kt−1,wt−Kt−2)和假的pair ( x ~ t − K t − 1 , w t − K t − 2 ) \left ( \tilde{\mathbf{x}}_{t-K}^{t-1}, \mathbf{w}_{t-K}^{t-2} \right ) (x~t−Kt−1,wt−Kt−2)

Foreground-background prior

当使用semantic segmentation masks作为source video时，可以将semantic segmentation分为foreground和background，利用这个信息可以生成更好的video

具体来说，将image hallucination network H H H拆分为foreground model h ~ F , t = H F ( s t − L t ) \tilde{\mathbf{h}}_{F,t}=H_F(\mathbf{s}_{t-L}^t) h~F,t=HF(st−Lt)和background model h ~ B , t = H B ( x ~ t − L t − 1 , s t − L t ) \tilde{\mathbf{h}}_{B,t}=H_B(\tilde{\mathbf{x}}_{t-L}^{t-1}, \mathbf{s}_{t-L}^t) h~B,t=HB(x~t−Lt−1,st−Lt)，则公式(4)修改如下
F ( x ~ t − L t − 1 , s t − L t ) = ( 1 − m ~ t ) ⊙ w ~ t − 1 ( x ~ t − 1 ) + m ~ t ⊙ ( ( 1 − m B , t ) ⊙ h ~ F , t + m B , t ⊙ h ~ B , t ) ( 9 ) F\left ( \tilde{\mathbf{x}}_{t-L}^{t-1}, \mathbf{s}_{t-L}^t \right )=\left ( 1-\tilde{\mathbf{m}}_t \right )\odot \tilde{\mathbf{w}}_{t-1}\left ( \tilde{\mathbf{x}}_{t-1} \right )+\tilde{\mathbf{m}}_t\odot \left ( (1-\mathbf{m}_{B,t})\odot \tilde{\mathbf{h}}_{F,t}+\mathbf{m}_{B,t}\odot \tilde{\mathbf{h}}_{B,t} \right ) \qquad(9) F(x~t−Lt−1,st−Lt)=(1−m~t)⊙w~t−1(x~t−1)+m~t⊙((1−mB,t)⊙h~F,t+mB,t⊙h~B,t)(9)
其中 m B , t \mathbf{m}_{B,t} mB,t是根据ground truth segmentation mask s t \mathbf{s}_t st计算得到的

使用Foreground-background prior可以极大地提高生成video的视觉质量，付出的代价仅仅是video中会有一些轻微的闪烁

Multimodal synthesis

在特征空间上做一些随机处理，从而可以生成多段不同的视频

4 Experiments

总共进行了3种类型的视频生成

Semantic manipulation，见Figure 2
Sketch-to-video synthesis for face swapping，见Figure 5
Pose-to-video synthesis for human motion transfer，见Figure 6

【总结】
本文提出了Video-to-Video生成的方法，相当于将pix2pix扩展到video上，在训练时，需要使用逐帧对应的两个视频序列（semantic segmentation mask -> video，sketch -> video，pose -> video）进行训练，在测试时，以dance video为例，给定一段video，对其提取pose序列，然后可以生成一段逼真的video，相当于将视频中的人进行了替换