
  • deep image matting(2017)
  • Context-Aware Image Matting for Simultaneous Foreground and Alpha Estimation(2019)
  • Indices matter
  • Semantic Human Mattimg
  • Towards Real-Time Automatic Portrait Matting on Mobile Devices
  • A Late Fusion CNN for Digital Matting
  • Learning to Composite Context-Realistic Data for Image Matting
  • Real-time deep hair matting on mobile devices
  • Efficient Semantic Video Segmentation with Per-frame Inference
  • Soft Instance Segmentation
  • SOLOv2
  • SOLOv1

  • pymatting 这个网站很不错

I=α×F+(1−α)×BI= \alpha \times F + (1-\alpha)\times BI=α×F+(1α)×B
以上表示一张图像是由前景和背景通过α\alphaα称之为alpha matte,作为一个蒙版,得到的,因此抠图技术多称之为alpha matting,即给出图片预测出α\alphaα

  • 但是怎样通过α\alphaα得到前景呢?又是一个病态问题?是不是α≠0\alpha\neq0α=0的地方抠出来就是前景?好像也不是?没有什么结果
    αI=αF\alpha I=\alpha FαI=αF


  • bayes matting
  • knn matting
  • closed formed matting

  • 算法没有复现,不再赘述

  • 基于深度学习的方法

deep image matting(2017)

  • Architecture

  • loss

    损失函数是alpha loss + composition loss

Context-Aware Image Matting for Simultaneous Foreground and Alpha Estimation(2019)


  • architecture

    matting encoder - local feature
    context encoder - global context information(larger down-sampleing factor)
  • loss
    L_lap 对alpha进行laplance分解多尺度计算loss
    vgg16 perceptural loss

Indices matter



  • IndexNet module 可以耦合再任意采样操作中,在我看来类似于attention的操作

    从上面可以看出,根据feature map 计算出index function作为一种信息,feature上采样或者下采样中带入到feature中以储存更多的信息

还有诸如guided contextual attention等通过输入img+trimap输入alpha matte的方法


Semantic Human Mattimg

  • Architecture

  • fusion module
  • loss



Towards Real-Time Automatic Portrait Matting on Mobile Devices

  • Architecture

  • encoder
    depthwise separable aspp module 深度可分离卷积的ASPP模块
  • EncoderBlock
def multiply_depth(depth, depth_multiplier, min_depth=8, divisor=8):'''get output num of channel, multiplies of divisor, determined by depth_multiplier'''multiplied_depth = round(depth * depth_multiplier)divisible_depth = (multiplied_depth + divisor // 2) // divisor * divisorreturn max(min_depth, divisible_depth)
class EncoderBlock(nn.Module):'''for dilation ratesconv_1x1 + [stride=2: sep_conv] + sep_conv_dilation'''def __init__(self, input_depth, expanded_depth, output_depth, depth_multiplier, rates, stride, shortcut_depth=None):super(EncoderBlock, self).__init__()input_depth = multiply_depth(input_depth, depth_multiplier)if shortcut_depth is not None:input_depth = input_depth + multiply_depth(shortcut_depth, depth_multiplier)expanded_depth = multiply_depth(expanded_depth, depth_multiplier)output_depth = multiply_depth(output_depth, depth_multiplier)self.op = nn.ModuleList()num_of_branch = 0for i, rate in enumerate(rates):conv = []conv += [nn.Conv2d(input_depth, expanded_depth, 1), nn.LeakyReLU(0.2, True)]if stride > 1:conv += [nn.Conv2d(expanded_depth, expanded_depth, kernel_size=3, stride=stride, padding=1, groups=expanded_depth), nn.LeakyReLU(0.2, True)]conv += [nn.Conv2d(expanded_depth, expanded_depth, kernel_size=3, padding=rate, dilation=rate, groups=expanded_depth), nn.LeakyReLU(0.2, True)]self.op.append(nn.Sequential(*conv))num_of_branch += 1self.out = nn.Conv2d(expanded_depth*num_of_branch, output_depth, 1)def forward(self, x):ops_out = [ops(x) for ops in self.op]ops_out = torch.cat(ops_out, dim=1)out = self.out(ops_out)return out
  • decoder
    1x1 卷积 + 双线性插值
  • refinement
    depthwise + pointwise

  • los

    第一项:alpha loss (l1 loss)
    第二项:composition loss(l1 loss)
    第三项:crossEntropy loss
    第五项:暂时没有用,用encoder的最后一层,也就是#encoder10和下采样后的alpha gt之间算交叉熵

  • 数据预处理
  1. 保证宽高比缩放
  2. 旋转后crop有效区域

下面准备复现下late fusion matting具体再看,也是一篇end2end的matting

A Late Fusion CNN for Digital Matting

使用two deocder branches+fusion branch

This design provides more degrees of freedom than a single decoder branch for the network to obtain better alpha values during training

他认为两个branch分别预测前景和背景给了alpha matte更多的自由度,这个是不是和mmnet最后预测结果为两个层,但是他计算loss是用的softmax后的结果,继续往下看看。

  • 暂时没时间看了,有空再填

Learning to Composite Context-Realistic Data for Image Matting


Real-time deep hair matting on mobile devices

  1. ddpthwise的卷积不出意外
  2. skip-connect只使用的1x1的卷积,并且变cat为add
  3. 下采样了32倍,没有dilation卷积,可能下采样给的已经够多了吧
  4. 0-1归一化,Adadelta优化器,lr=1?, weight_decay=2e-5

  • 但是在视频帧上,连续一致性好像也是非常重要的一部分,视频帧的分割/matting该怎么做呢?
    google 这个新闻提到可以使用上一帧的结果来指导下一帧的分割,但是这样增加了计算量

Efficient Semantic Video Segmentation with Per-frame Inference

在训练阶段使用了tempral consistency

  • 以前的工作主要使用两种方法来提升视频分割性能
  1. 后处理或者使用额外的模块使用多帧之间的信息
  2. keyframe policy 关键帧的信息,即是在关键帧使用更多的计算资源,再将关键帧的信息传播,这样带来两个问题,一是资源的不平衡,二是距离关键帧越远信息越不准确
  • 本文使用一种叫motion guided temporal loss
  • temporal consistency knowledge distillation
  • Architecture
    PF/MF指的是temporal consistency knowledge distillation
    temporal consistency loss指的是
  • Temporal loss

  • PF/MF

  • AT(attention) operator

  • PF

  • MF
    提取分类前最后一层的feature,计算self-similarity map,在经过LSTM获取embed

    S和T网络共享LSTMd的参数,文章提到模型会容易崩塌,当LSTM的权重为0时,做了weight clip和enlarges E

  • optimization

  1. 交叉熵损失
  2. SF单帧知识蒸馏的loss
  3. tl-temporal loss,两帧motion约束
  4. PF S和T同一帧的AT约束
  5. MF S和T视频所有帧之间通过LSTM得到embed之间的蒸馏损失
  6. KaTeX parse error: Undefined control sequence: \lambd at position 1: \̲l̲a̲m̲b̲d̲=0.1

实际上,只是用了mmnet 的网络结构进行训练还有hair matting 中的deeplab

Soft Instance Segmentation

soft instance segmentation - SOFI
事实证明它不仅可以提升分割的准确率,也能再matting 上取得很好的结果

  • simple instance segmentation framework SOLO [30], termed Soft SOLO (SOSO)
  • Dataset
    The dataset contains a total of 200 images and 537 high-quality alpha mattes, termed as SOFI-200

  • Method
  • SOLO
    segment objects by location. It conceptually divide the input image into S × S grids. If the center of an object falls into a grid cell, that grid cell is responsible for predicting the object class as well as assigning the per-pixel location categories.
  • SOLOv2 [30]
    takes a step further by predicting the mask features and convolution kernels separately. Overall, they take an image as input and generate a set of object class probabilities and corresponding binary masks

  • network architecture
    基于 SOLOv2 框架
    backbone + FPN + prediction head(object category branch + mask kernel branch)[shared weights for different level]
  • mask feature branch
  1. fuses feature maps and 1/4 scale
  2. add Low-level Module(LLM)

  3. 1x1 卷积从1/4 scale 到 1/1 scale?
  4. LLM 的输出通过1x1 卷积扩充到256 channel 然后和先前的1/4 features 相加得到结果,再和pred Kernel 卷积
  5. 这里的pred Kernel 指的是?

  • Loss

    LcateL_{cate}Lcate 是focal loss 分类
    LmatteL_{matte}Lmatte 是soft matte prediction

    均方误差和Dice 系数,后面的消融实验证明二者缺一不可
  • 这里的 cate 和dice 都是分类的损失,cate 是可以为概率的那么dice 呢?

  • training
  1. 训练阶段category branch 权重是固定的,为了保证良好的目标识别能力
  2. 使用了mixed training strategy,matting 和实例分割的数据集都使用了,前者监督matte,后者监督实例

  • inference
    the prediction branches execute in parallel after the backbone network and FPN, producing category scores, predicted convolution kernels and mask features
    上面这句话怎么看不懂呢?预测的分支不就是category 和mask features?
  1. category score p_i,j 再grid(i,j ) 使用0.1 的阈值过滤低置信度的预测??
  2. mask features 使用sigmoid输出结果
  3. finally,使用matrix NMS 移除冗余的预测,这又是什么操作??


  • Dynamic
    dynamically learning the mask head of the object segmenter such that the mask head is conditioned on the location. Specifically, the mask branch is decoupled into a mask kernel branch and mask feature branch, which are responsible for learning the convolution kernel and the convolved features respectively
    动态学习目标检测器的mask head,mask head 是位置约束的
    mask branch 解耦为mask kernel branch 和 mask feature branch 分别负责学习convolution kernel 和 convilved features
  • Matrix NMS(non maximum sup-pression)

改进了SOLOv1 的 mask learning and mask NMS
看不懂,先看SOLOv1 吧


mainstream approaches either follow the “detect-then-segment” strategy, as used by, e.g., Mask R-CNN, or predict embedding vectors first then use clustering techniques to group pixels into individual instances

主流的方法要么遵循“检测-分割”的策略,例如Mask R-CNN, 或者预测编码向量然后像素聚类分割实例

We view the task of instance segmentation from a completely new perspective by introducing the notion of “instance categories”, which assigns categories to each pixel within an instance according to the instance’s location and size, thus nicely converting instance mask segmentation into a classification-solvable problem

文章将实例分割问题从“实例分类”概念的角度出发,根据实例的位置和大小将每个像素分类,这样将实例mask 分割问题转为分类可解的问题



  • intruction
  • motivation
    COCO 数据集中,每个98.3% 的实例中心距离都是大于30个像素的,剩下的1.7%中又有40.5% 的实例大小比例是超过1.5倍的,所以我们是不是可以认为实例之间的拥有不同的中心位置或者不同的尺寸大小。基于此观察我们是否可以直接通过中心位置和目标大小来区分不同的实例?


  • SOLO: Segment objects by locations
  • Locations
    将一幅图像分为SxS 个网格区域,这样就有S平方个中心点位置类
    这样输出的类别分支不再是hwc,而是ssc?好像不是这样的,为什么要分类别呢?直接预测类别个mask 不就行了吗,这又和语义分割一样了,category branch 是用来区分实例的吗?和多目标检测又有什么联系?目标检测的输出是什么?
    同时输出的channel map 预测属于当前类别的instance mask


