Uformer: A General U-Shaped Transformerfor Image Restoration
目录
(1)Encoder
(2)Bottleneck stage(图一,最下面的两个LeWin Transformer blocks)
(3)Decoder
LeWin Transformer Block
1.Window-based Multi-head Self-Attention (W-MSA).
Locally-enhanced Feed-Forward Network (LeFF).
3.3 Variants of Skip-Connection
1.Concatenation-based Skip-connection (Concat-Skip).
2.Cross-attention as Skip-connection (Cross-Skip)
3.Concatenation-based Cross-attention as Skip-connection (ConcatCross-Skip).
5 Conclusions
Limitation and Broader Impacts.
(1)Encoder
For example,given the input feature maps X0 ∈ RC×H×W , the l-th stage of the encoder produces the feature maps
(2)Bottleneck stage(图一,最下面的两个LeWin Transformer blocks)
4.Then, a bottleneck stage with a stack of LeWin Transformer blocks is added at the end of the encoder. In this stage, thanks to the hierarchical structure, the Transformer blocks capture longer (even global when the window size equals the feature map size) dependencies.
(3)Decoder
2.After that, the features inputted to the LeWin Transformer blocks are the up-sampled features and the corresponding features from the encoder through skip-connection.Next, the LeWin Transformer blocks are utilized to learn to restore the image.
3.After the K decoder stages, we reshape the flattened features to 2D feature maps and apply a 3 × 3 convolution layer to obtain a residual image R ∈ R3×H×W.
4.Finally, the restored image is obtained by I' = I + R.
remark:In our experiments, we empirically set K = 4 and each stage contains two LeWin Transformer blocks. We train Uformer using the Charbonnier loss.
LeWin Transformer Block
- where X'l and Xl are the outputs of the W-MSA module and LeFF module respectively.
- LN represents the layer normalization.
解释:
接下来介绍下这个 LeWin Transformer Block 的机制。传统的 Transformer在计算 self-attention 时会计算全部 tokens 之间的attention,假设序列长度是 N,即有 N 个 tokens,Embedding dimension为 d ,则计算复杂度是O(N*N*d) 。因为分类任务的图片大小一般是224×224,当我们把图片分块成16×16大小的块时,可以分成196(224*224/16*16 = 196)个patches,token数就是196。
但是对于 Image Restoration 的任务一般都是高分辨率的图片,比如1048×1048的图片,此时得到的patches数量会很多,如果还计算全部 tokens 之间的attention的话,计算量开销会变得很大,显然不合适。所以第1个结论是:计算全部 tokens 之间的 attention 是不合适的。
另外,对于 Image Restoration 的任务local的上下文信息很重要。那么什么是local的上下文信息呢?就是一个patch的上下左右patch的信息。那么为什么local的上下文信息信息很重要呢?因为一个 degraded pixel 的邻近 pixel 可以帮助这个 pixel 完成重建。所以第2个结论是:每个 patch 的 local 的上下文信息很重要。
为了解决第1个问题,作者提出了一种 locally-enhanced window (LeWin) 的 Transformer Block,如上图2(b)所示。它既得益于 Self-attention 的建模长距离信息的能力,也得益于卷积网络的捕获局部信息的本领
1.Window-based Multi-head Self-Attention (W-MSA).
![](/assets/blank.gif)
4.Then the outputs for all heads {1, 2, . . . , k} are concatenated and then linearly projected to get the final result.
这个模块是LeWin Transformer Block 的第1步。具体而言,它不使用全部的 tokens 来做 Self-attention,而是在不重叠的局部窗口 (non-overlapping local windows) 里面去做attention,以减少计算量。
问:Window-based Multi-head Self-Attention (W-MSA)能够节约多少计算量?
作用:
- significantly reduce the computational cost compared with global self-attention
- at low resolution feature maps works on larger receptive fields and is sufficient to learn long-range dependencies.
Locally-enhanced Feed-Forward Network (LeFF).
3.
Then we flatten the features to tokens and shrink the channels via another linear layer to match the dimension of the input channels.4.We use GELU [52] as the activation function after each linear/convolution layer.
Feed-Forward Network (FFN) 利用局部上下文信息的能力有限。但是相邻像素的信息却对图像复原的任务来讲很重要,因此作者仿照之前的做法在FFN中添加了 Depth-wise Convolution,如图所示。
- 1.首先特征通过一个 FC 层来增大特征的维度 (feature dimension),
- 2.再把序列化的1D特征 reshape 成 2D的特征,
- 3.并通过 3×3 的 Depth-wise Convolution来建模 local 的信息。
- 4.再把2D的特征 reshape 成序列化的1D特征,
- 5.并通过一个 FC 层来减少特征的维度到原来的值,每一个 FC 层或 Depth-wise 卷积层后使用 GeLU 激活函数。
![]()
综上,locally-enhanced window (LeWin) 的过程可以用上式表达
3.3 Variants of Skip-Connection
To investigate how to deliver the learned low-level features from the encoder to the decoder
1.Concatenation-based Skip-connection (Concat-Skip).
2.Then, we feed the concatenated features to the W-MSA component of the first LeWin Transformer block in the decoder stage, as shown in Figure 2(c.1).
2.Cross-attention as Skip-connection (Cross-Skip)
2.The first self-attention module in this block (the shaded one) is used to seek the self-similarity pixel-wisely from the decoder features Dl−1, and the second attention module in this block takes the features El from the encoder as the keys and values, and uses the features from the first module as the queries.
3.Concatenation-based Cross-attention as Skip-connection (ConcatCross-Skip).
Combining above two variants, we also design another skip-connection.As illustrated in Figure 2(c.3),
we concatenate the features El from the encoder and Dl−1 from the decoder as the keys and values, while the queries are only from the decoder.
为了研究如何把 Encoder 里面的 low-level 的特征更好地传递给 Decoder,作者探索了3种Skip-Connection的方式,分别是:
- Concatenation-based Skip-connection (Concat-Skip).
- Cross-attention as Skip-connection (Cross-Skip)
- Concatenation-based Cross-attention as Skip-connection (Concat Cross-Skip)
名字听上去非常花里胡哨,但其实很直觉,如下图5所示。
![]()
5 Conclusions
Limitation and Broader Impacts.
Thanks to the proposed structure, Uformer achieves the state-of-the-art performance on a variety of image restoration tasks (image denoising, deraining, deblurring, and demoireing). We have not tested Uformer for more vision tasks such as image-to-image translation, image super-resolution, and so on. We look forward to investigating Uformer for more applications. Meanwhile we notice that there are several negative impacts caused by abusing image restoration techniques. For example, it may cause human privacy issue with the restored images in surveillance. Meanwhile, the techniques may destroy the original patterns for camera identification and multi-media copyright [65], which hurts the authenticity for image forensics.
小结
Uformer 是 2 个针对图像复原任务 (Image Restoration) 的 Transformer 模型,它的特点是长得很像 U-Net (医学图像分割的经典模型),有Encoder,有Decoder,且 Encoder 和 Decoder 之间还有 Skip-Connections。另外,在 Transformer 的 LeWin Transformer Block 中包含2个结构,一个是W-MSA,就是基于windows的attention模块,另一个是LeFF,就是融入了卷积的FFN。
Vision Transformer 超详细解读 (原理分析+代码解读) (十三) - 知乎
Uformer: A General U-Shaped Transformerfor Image Restoration相关推荐
- 论文笔记 |【CVPR2021】Uformer: A General U-Shaped Transformer for Image Restoration
论文笔记 |[CVPR2021]Uformer: A General U-Shaped Transformer for Image Restoration 文章目录 论文笔记 |[CVPR2021]U ...
- Uformer: A General U-Shaped Transformer for Image Restoration阅读笔记
Abstract 构建一个分层的编码-解码器,并使用Transformer block进行图像恢复. Uformer两个核心设计:1. local-enhanced window Transforme ...
- MyDLNote-Transformer(for Low-Level): Uformer: U 型 Transformer 图像修复
论文阅读之 - 用 Transformer 做图像修复 Uformer: A General U-Shaped Transformer for Image Restoration https://ar ...
- 计算机视觉论文-2021-06-08
本专栏是计算机视觉方向论文收集积累,时间:2021年6月8日,来源:paper digest 欢迎关注原创公众号 [计算机视觉联盟],回复 [西瓜书手推笔记] 可获取我的机器学习纯手推笔记! 直达笔记 ...
- CVPR 2022 Oral | MLP进军底层视觉!谷歌提出MAXIM:刷榜多个图像处理任务,代码已开源!...
点击下方卡片,关注"CVer"公众号 AI/CV重磅干货,第一时间送达 作者:假熊猫 | 已授权转载(源:知乎)编辑:CVer https://zhuanlan.zhihu.co ...
- Transformer 在计算机视觉领域疯狂“内卷”
继『Transformer 杀疯了,图像去雨.人脸幻构.风格迁移.语义分割等通通上分』之后,Transformer 在计算机视觉领域继续疯狂"内卷". 01 ...
- Flare7K数据集简介与论文阅读
Flare7K数据集简介与论文阅读 概述 1 简介 2 相关工作 2.1 镜头炫光数据集 2.2 单张图像炫光去除 2.3 夜间去雾和夜间可视性增强 3 夜间镜头炫光的物理学分析 3.1 散射炫光 3 ...
- iTerm2 for MacOS(终端模拟器/终端仿真器/命令终端工具)设置详解
文章目录 General 通用 Startup 启动 Closing 关闭 Magic Selection Window tmux Appearance General 隐藏程序图标 theme 主题 ...
- Practical Go: Real world advice for writing maintainable Go programs
转载地址:Practical Go: Real world advice for writing maintainable Go programs Table of Contents Introduc ...
最新文章
- oracle lz压缩,LZ:Oracle热备期间过量Redo生成控制
- shell处理curl返回数据_shell神器curl用法笔记
- 倒出mysql库命令行_mysql命令行导入导出数据库
- 五分钟了解一致性hash算法!
- C++用并查集Disjoint union实现connected component连通分量(附完整源码)
- 4.day11_包和权限修饰符-1
- NLP界新SOTA!吸纳5000万级知识图谱,一举刷爆54个中文任务!
- iphone储存空间系统怎么清理_iPhone储存空间里其他占了几十GB,怎样彻底删除它?...
- ipv6远程连接mysql_如何利用IPv6进行远程桌面连接
- 数学建模--因子分析模型
- JRebel过期激活
- 巴菲特佛罗里达州立大学演讲
- 完美融入云原生的无代码平台 iVX编辑器实战
- MultiValueMap是什么?怎么使用?
- 视频教程-按键精灵手机版解放您的双手自动化教程-Android
- 04.iOS 使用lame将wav转换为mp3
- 图神经网络07-从零构建一个电影推荐系统
- Python相关基础信息
- 获取当前月的 下一个月1号
- java util:全国省市工具类