Onion-Peel Networks for Deep Video Completion

论文地址

github地址

文章目录

总述
网络结构
训练
推理阶段
实验

总述

ICCV 2019的文章，提出OPN网络（Onion-Peel Network）去解决video inpainting的问题。使用OPN可以解决video completion的问题，也可以解决image competion的问题。在下文中，需要进行填充的图像成为target，用于引导的图像称为 reference。
对于video completion，输入是覆盖了mask的视频帧，采样一组帧作为reference，依据reference帧的内容或者合成一致的内容去填充target帧上mask覆盖造成的空洞。
对于image completion，输入是target image和额外的reference image。使用一组图片可以做到将target image中一些不需要的目标进行移除而不破坏原始内容。可被视为是只有几帧的特殊video completion。之前通过计算光流来进行vedio completion的方法不适用这个任务，因为在相隔很远的帧（一组image相当就是不连续的帧）进行光流计算很困难（因为光流就是瞬时速率，只有在时间间隔很小，比如视频的连续前后两帧之间才成立）。
这个网络的名字起得很形象，onion-peel，即他对target中的空洞填充是像剥洋葱一样一层层得进行填充，参考reference image的有效区域，每次只填充target空洞的peel region（空洞的boundary），整个填充过程是在几次循环后完成的。这样做的话，在每一步缺失区域都能获得更丰富的语义信息（即上一步填充的层也能够为后续提供信息），只要循环足够多次，即使空洞区域很大也能被成功上色。

网络结构

首先target image和reference image通过Encoder获得key和value的feature map。从名字就可以看出，key用来寻找target和reference中对应的像素，Asymmetric Attention Block中，对每次填充的层中的key feature和reference中的每个有效（非空洞区域内）key feature进行match。匹配的结果是一个spatio-temporal attention map，即给出那个frame中的哪个像素对于填充peel region中的像素比较重要。根据这个attention map，retrieve到reference image中的value feature，将其和peel region中的value feature相加。Decoder获得更新后的target value feature和peel region的mask，对target中的peel region进行填充。
整个结构还是比较清晰简单的，主要有三个部分，Encoder， Asymmetric Attention Block和Decoder。

Encoder：输入RGB image（3 channels，空洞区域色彩为grey），空洞mask（1 channel），有效区域mask（1 channel）。concat为5-channel image输入网络。下采样的步长为4。并行输出key feature和value feature。所有image共享Encoder网络。

Asymmetric Attention Block：这个应该是网络的关键部分，决定了本次循环中target的peel region如何利用reference中的有效区域进行填充。其输入为target和reference的key feature和value feature。这部分流程如下：

可以看到输入的target feature map尺寸为 h×w×128（h，w为原始输入尺寸的四分之一，128为feature维度），reference为n×h×w×128（n为reference的个数）。将target此次需要填充的peel region中的像素提取出来，用P进行index，得到新的维度为c×128的key r和value feature map q（c为peel region中的像素总数）。对reference采取同样的操作得到k和v，这样可以简化后续计算。对r和k的转置做点乘操作，得到维度为 c×m 的矩阵，这个矩阵（i，j）位置上的值其实就是第 i 个P中的 key feature和第 j 个 V 中的key feature的cosine相似度（两个128维向量的内积），该值表示说明第 j 个validity 像素对恢复第 i 个peel region像素的贡献度。使用softmax函数对这个矩阵进行正则化得到matching score map，s*。我感觉这样就让每个P中的像素都可以被V中的像素以加权平均的方式进行表示。s*（c×m）与v*（m×128）进行点乘操作得到c×128的矩阵，就是一个retrieve的过程，利用对应的validity像素value feature的加权和推测peel region像素处的value feature u*。将推测得到的u与原始q相加得到最终的value feature，并将值赋回最初的value feature map q的对应位置上。

Decoder：输入上一个block输出的h×w×128的value feature map，对peel region进行重建，就是一个decode的过程啦，所以设计上和encode基本对称。使用nearest neighbor upsampling来扩大feature map，然后把decode输出的原始图像的peel region部分抠出来和输入的target image的有效区域合并起来，就是这次循环得到的结果。

训练

设计loss function来限制逐像素重建的准确度和视觉相似度。所以在每次循环中分别在像素空间和深层特征空间最小化和GT的L1距离。

pixel loss：很简单，直接放一下图好了

为什么把peel region和valid region的loss分开计算，是为了之后通过设置权重将训练的重点更多地放在恢复peel area上。

perceptual loss：内容相似度和风格相似度，感觉有点本科做style transfer时候的意思了。

content相似度是用VGG-16提取feature map计算，Gram matrix就是Gatys当时做风格迁移（Image
style transfer using convolutional neural networks）时提出的，Gram matrix一个介绍。
计算pixel loss的时候用的是decoder的原始输出，perceptual loss用的是和target merge之后的输出。

total loss：

最后一项是参数正则项，防止过拟合（我自己瞎想的）

对于需要一帧一帧处理的video来说，输出会有虚影，增加了一个额外的时序一致性网络（temporal consistency network）进行后处理。（an encoder-decoder network equipped with a convolutional GRU at the core is trained to balance between the temporal stability with the previous frame and the perceptual similarity with the current frame. We modified the original method to match our need which is to stabilize the inpainted contents.）（他说会在supplementary materials里细说，但我没看到，等看代码的时候再说吧）

推理阶段

不想翻了，重点标粗。
We present two applications of the proposed onion-peel network: reference-guided image completion and video completion. For the image completion, reference images are embedded into key and value feature maps through the encoder. Then, the whole network (encoder, asymmetric attention block, and decoder) is applied to the target image recursively to fill the peel region iteratively until the hole is completely filled. For the video completion, the procedure of the image inpainting is looped over every frame sequentially. In the case of video, a set of the reference images is sampled from the video. In our implementation, we sampled every 5-th frame as the references. The detailed procedure of the video inpainting is described in Alg. 1. Image completion is a specific case of the video completion with one target image and a set of reference images. Here, we define i-th video frame as Xi and its hole mask as Hi. The peel region P is defined as the hole pixels that are within the Euclidean distance of p to the nearest non-hole pixel. We set p to 8. The validity map Vi indicates genuine non-hole pixels that are not filled by the algorithm. Because the reference frames are not changed during inpainting a video, we run the encoder for the reference frames only once.

实验

和其他方法比较了一下，证明了一下一层层一层层重建比one-shot方式能够获得更高质量的重建效果，以及加入后处理后的确能够让temporal profile（不知道怎么翻）更平滑，但会倾向于对影像做模糊处理。