上个月和同事参加了Kaggle蛋白质图集多标签分类竞赛,获得第5,转发下同事写的Solution。

https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77731

First of all, congratulations to all the winners! Thanks to Kaggle and HPA team for hosting such an interesting competition and thanks to TomomiMoriyama, Heng CherKeng, ManyFoldCV and Spytensor.

Here is a brief summary of our solution.

DataSet

Like most other competitors, we used both official (both PNG and TIFF) and external data. To deal with class-imbalance, we used WeightedRandomSampler (method in pytorch) during training and MultilabelStratifiedShuffleSplit to split the data into training and validation. We constructed 10 folds cross validation sets with 8% for validation.

Image Preprocessing

The HPA dataset has four dyeing modes each of which is an RGB image of its own, so we took only one channel (r=r,g=g,b=b,y=b) to form a 4-channel input for training.

All PNG images are kept at their original 512 size, whereas the TIFF images are resized to 1024.

Augmentation

Rotation, Flip, and Shear.

We didn't use random cropping. Instead we trained 5 models using crop5 (method in pytorch) and found it to be more effective.

Models

For our base networks, we mainly used Inception-v3,-v4, and Xception. We have also tried DenseNet, SENet and ResNet, but the results were suboptimal.

We used three different scales during training (512 for PNG images and 650, 800 for TIFF images) with different random seeds for the 10-folds CV.

Modifications

  1. Changed the last pooling layer to global pooling.
  2. Appended an additional fully connected layer with output dimension 128 after the global pooling.
  3. We also divided the training process into two stages where the first stage used size 512 with model trained on ImageNet, and the second stage used size 650 or 800 with model trained from the first stage. We found this to be slightly better than training with fixed size all the way.

Training

  • loss: MultiLabelSoftMarginLoss
  • lr: 0.05 (for size 512, pretrained on ImageNet),0.01 (for size 650 and 800,pretrained using size 512); lrscheduler: steplr(gamma=0.1,step=6)
  • optimizer: SGD
  • epochs: 25, early stopping for training with size 650 or 800 (around 15 epochs), model selected based on loss (instead of F1 score)
  • sampling weights for different classes: [1.0, 5.97, 2.89, 5.75, 4.64, 4.27, 5.46, 3.2, 14.48, 14.84, 15.14, 6.92, 6.86, 8.12, 6.32, 19.24, 8.48, 11.93, 7.32, 5.48, 11.99, 2.39, 6.3, 3.0, 12.06, 1.0, 10.39, 16.5]

Multi-Thresholds

We used the validation sets to search for threshold for each class by optimizing the F1 score begining with 0.15 for all classes.

Test

(with multi-thresholds)

Ensembling

Final prediction is ensemble of above methods: Size 800, 10-fold for Inception-v3; Size 650 and 800, 10-fold for Inception-v4; Size 800, 10-fold, Size 650, 1-fold, Size 512, 5-fold for Xception (the reason for 5-fold instead of 10 was simply because we didn't have enough submissions to check the performances of all models, so we simply took the best ones).

Things that did not work for us

  • Training with larger input size (>= 1024), which forced us reduce the batch size.
  • 3-channel input
  • focal loss
  • C3D
  • TTA: unlike a lot of other competitors, TTA during test time actually didn't work for us.
  • Other traditional machine learning methods such as DecisionTree, RandomForest, and SVM.

Kaggle蛋白质图集多标签分类竞赛相关推荐

  1. 【直播课】图像分类竞赛技巧与多标签分类实战

    前言 对于刚接触深度学习计算机视觉的初学者来说,图像分类问题是最常见的问题,如何做好图像分类任务,关系到大家能否正确顺利地入门.读了许多论文,可能仍然不懂代码如何实现.跑了代码,仍旧不懂如何运用图像分 ...

  2. 总奖池2.5万美元,Kaggle 新赛,单细胞分类与分割

    ●赛题背景● 人是由数万亿细胞组成,即使是基因相同的双胞胎,细胞也存在差异.而蛋白质位置的不同就会引起这种细胞的异质性. 蛋白质在几乎所有的细胞过程中都扮演着重要的角色.通常情况下,许多不同的蛋白质聚 ...

  3. 使用 scikit-learn 实现多类别及多标签分类算法

    多标签分类格式 对于多标签分类问题而言,一个样本可能同时属于多个类别.如一个新闻属于多个话题.这种情况下,因变量yy需要使用一个矩阵表达出来. 而多类别分类指的是y的可能取值大于2,但是y所属类别是唯 ...

  4. 基于Transformer实现更精准的脑出血多标签分类

    本文已在飞桨公众号发布,查看请戳链接: 基于Transformer实现更精准的脑出血多标签分类 灵医智惠是百度旗下深耕医疗领域的AI医疗品牌,多年来一直致力于将AI能力深度赋能医疗行业,加速智慧医疗产 ...

  5. 【经验分享】TinyMind 多标签图像分类竞赛小试牛刀——by:for the dream

    多标签图像分类竞赛地址:https://www.tinymind.cn/competitions/42?from=blog 队伍:for the dream,其实是大酒神死忠粉~ 初次拿到这个题目,想 ...

  6. python文本分类评价指标 top1如何计算_python – Keras:如何计算多标签分类的准确......

    我正在做有毒评论文本分类Kaggle挑战.有6个类:['威胁','严重毒性','淫秽','侮辱','身份_用','有毒'].注释可以是这些类的多个,因此它是一个多标签分类问题. 我用Keras建立了一 ...

  7. 【小白学习PyTorch教程】十六、在多标签分类任务上 微调BERT模型

    @Author:Runsen BERT模型在NLP各项任务中大杀四方,那么我们如何使用这一利器来为我们日常的NLP任务来服务呢?首先介绍使用BERT做文本多标签分类任务. 文本多标签分类是常见的NLP ...

  8. 系统学习机器学习之总结(三)--多标签分类问题

    前沿 本篇记录一下自己项目中用到的keras相关的部分.由于本项目既有涉及multi-class(多类分类),也有涉及multi-label(多标记分类)的部分,multi-class分类网上已经很多 ...

  9. 基于Ernie-3.0 CAIL2019法研杯要素识别多标签分类任务

    相关项目: Paddlenlp之UIE模型实战实体抽取任务[打车数据.快递单] Paddlenlp之UIE分类模型[以情感倾向分析新闻分类为例]含智能标注方案) 应用实践:分类模型大集成者[Paddl ...

最新文章

  1. java wed登录面 代码_Java Web用户登录实例代码
  2. linux sh脚本 while,Linux shell脚本使用while循环执行ssh的注意事项
  3. PHP中处理函数的函数(Function Handling Functions)
  4. 【独家:震惊!——西城区所有学区优质度透解与大排名,泄密了!】
  5. Java面向对象(16)--单例(Singleton)设计模式
  6. apache域名本地映射
  7. 京东淘汰“三类人”,近 18 万员工懵了?!
  8. Java通过FFMPEG获取视频时长
  9. python计算出nan_学习笔记0522:Tensorflow训练模型出现loss是nan的问题排查
  10. 网络安全等级测评师培训(初级)----2021.6.6
  11. ffmpeg命令下载m3u8原画质视频
  12. redis的failover ,redmon安装
  13. 神经网络的数学表达式,神经网络的数学基础
  14. WPS下合并doc文档
  15. java 随机金额_java_微信随机生成红包金额算法java版,最近几年玩得最疯狂的应该是 - phpStudy...
  16. zz在Excel中作ROC曲线
  17. linux下的计划任务
  18. ElasticSearch教程
  19. 360和广点通广告SDK注意事项
  20. python 画任意函数曲线_使用Python画数学函数曲线

热门文章

  1. 每日练习 之九九乘法表
  2. 电源拓扑从入门到精通 - 4
  3. 魔百和 MG100/M101/MG101 刷机包
  4. spring-boot-maven-plugin:3.0.0:repackage 报错原因
  5. 指令系统-CISC和RISC的区别
  6. 基于51单片机交通灯控制系统
  7. Day1:C语言循环控制结构例题之求sinx近似值
  8. python 3d图如何改变视角_python – 改变3D图垂直(z)轴的位置(Matplotlib)?
  9. Google Analytics访问来源数据分析
  10. 《勋伯格和声学》读书笔记(三):七和弦结构及其转位和连接