This paper systematically concludes the classical loss functions for hierarchical multi-label classification (HMC), and extends the Hamming loss and Ranking loss to support class hierarchy.
Reading Difficulty: ⋆ ⋆ \star\star ⋆⋆
Creativity: ⋆ ⋆ \star\star ⋆⋆
Comprehensiveness (全面性): ⋆ ⋆ ⋆ ⋆ ⋆ \star\star\star\star\star ⋆⋆⋆⋆⋆

Symbol System:

Symbol Meaning
y i ∈ { 0 , 1 } y_i \in \{0,1\} yi​∈{0,1} The label for class i i i
↑ ( i ) , ↓ ( i ) , ⇑ ( i ) , ⇓ ( i ) , ⇔ ( i ) \uparrow(i),\downarrow(i),\Uparrow(i),\Downarrow(i),\Leftrightarrow(i) ↑(i),↓(i),⇑(i),⇓(i),⇔(i) The parent, children, ancestors, descentors, and sibilings of node i i i
y i ∈ { 0 , 1 } i \mathbf{y}_{\mathbf{i}} \in \{0,1\}^\mathbf{i} yi​∈{0,1}i the label vector for classes i \mathbf{i} i
H = { 0 , … , N − 1 } \mathcal{H} = \{0,\dots,N-1\} H={0,…,N−1} The class hierachy, where N N N is the number of nodes
I ( x ) I(x) I(x) An indicator function output 1 when x is true, 0 otherwise.
R \mathcal{R} R The conditional risk

Hierarchy Constraints
In HMC, if the label structure is a tree, we have:
y i = 1 ⇒ y ↑ ( i ) = 1. y_i = 1 \Rightarrow y_{\uparrow(i)} = 1. yi​=1⇒y↑(i)​=1.

For the DAG-type HMC with, there are two interpretations:

  1. AND-interpretation. We have y i = 1 ⇒ y ↑ ( i ) = 1 y_i=1 \Rightarrow y_{\uparrow(i)} = \mathbf{1} yi​=1⇒y↑(i)​=1

  2. OR-interpretation. We have y i = 1 ⇒ ∃ y ↑ ( c ) = 1 y_i=1 \Rightarrow \exist y_{\uparrow(c)} = 1 yi​=1⇒∃y↑(c)​=1

Loss functions for Flat and Hierarchical Classification
It is a review.

Zero-one loss:
ℓ 0 / 1 ( y ^ , y ) = I ( y ^ ≠ y ) \ell_{0/1}(\hat{\mathbf{y}}, \mathbf{y}) = I(\hat{\mathbf{y}}\neq \mathbf{y}) ℓ0/1​(y^​,y)=I(y^​=y)

Hamming loss:
ℓ hamming ( y ^ , y ) = ∑ i ∈ H I ( y ^ i ≠ y i ) \ell_{\text{hamming}}(\mathbf{\hat{y}},\mathbf{y}) = \sum_{i \in \mathcal{H}} I(\hat{y}_i \neq y_i) ℓhamming​(y^​,y)=∑i∈H​I(y^​i​=yi​)

Top- k k k precision:
k k k most-confident predicted positive labels for each sample.
top-k-precision ( y ^ , y ) = The number of TP predictions in the top-k labels of  y ^ k \text{top-k-precision}(\hat{\mathbf{y}}, \mathbf{y}) = \frac{\text{The number of TP predictions in the top-k labels of } \hat{\mathbf{y}}}{k} top-k-precision(y^​,y)=kThe number of TP predictions in the top-k labels of y^​​
So the loss is
ℓ top-k = 1 − top-k-precision \ell_{\text{top-k}} = 1 - \text{top-k-precision} ℓtop-k​=1−top-k-precision

Ranking loss:
ℓ rank = ∑ ( i , j ) : y i > y j ( I ( y i ^ < y ^ j ) + I ( y i ^ = y ^ j ) 2 ) \ell_{\text{rank}} = \sum_{(i,j):y_i > y_j} (I(\hat{y_i} < \hat{y}_j) + \frac{I(\hat{y_i} = \hat{y}_j)}{2}) ℓrank​=∑(i,j):yi​>yj​​(I(yi​^​<y^​j​)+2I(yi​^​=y^​j​)​)

Hierarchical Multi-class Classificaiton
A review.
Note: Only a single path can be predicted positive.

Cai and Hofmann:
ℓ = ∑ i ∈ H c i I ( y ^ i ≠ y i ) \ell = \sum_{i \in \mathcal{H}} c_i I(\hat{y}_i \neq y_i) ℓ=∑i∈H​ci​I(y^​i​=yi​)
where c i c_i ci​ is the cost for node i i i.

Dekel et al. :
It seems that this loss is complicated.
But this paper treats this loss as similar to the former loss?

Hierarchical multi-label classification

H-Loss:
ℓ H = α ∑ i : y i = 1 , y ^ i = 0 c i I ( y ^ ⇑ ( i ) = y ⇑ ( i ) ) + β ∑ i : y i = 0 , y ^ i = 1 c i I ( y ^ ⇑ ( i ) = y ⇑ ( i ) ) \ell_H = \alpha \sum_{i:y_i=1,\hat{y}_i=0} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) + \beta \sum_{i:y_i=0,\hat{y}_i = 1} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) ℓH​=α∑i:yi​=1,y^​i​=0​ci​I(y^​⇑(i)​=y⇑(i)​)+β∑i:yi​=0,y^​i​=1​ci​I(y^​⇑(i)​=y⇑(i)​)
where α \alpha α and β \beta β are weight for FN and FP.

Often, misclassifications at upper class level are considered more expensive than those at the lower levels.
Thus, there are a cost assigning approach
c i = { 1 , i = 0, c ⇑ ( i ) n ⇔ ( i ) , i > 0 , c_i = \left\{ \begin{aligned} & 1, & \text{ i = 0,} \\ & \frac{c_{\Uparrow(i)}}{n_{\Leftrightarrow(i)}}, & \text{ i > 0}, \end{aligned}\right. ci​=⎩ ⎨ ⎧​​1,n⇔(i)​c⇑(i)​​,​ i = 0, i > 0,​
where n ⇔ ( i ) n_{\Leftrightarrow(i)} n⇔(i)​ is the number of siblings of i i i (including i i i).

Matching Loss:
ℓ match = α ∑ i : y i = 1 ϕ ( i , y ^ ) + β ∑ i : y ^ i = 1 ϕ ( i , y ) \ell_{\text{match}} = \alpha \sum_{i:y_i=1}\phi(i, \hat{\mathbf{y}}) + \beta \sum_{i:\hat{y}_i = 1} \phi(i, \mathbf{y}) ℓmatch​=α∑i:yi​=1​ϕ(i,y^​)+β∑i:y^​i​=1​ϕ(i,y).
where
$
\phi(i,\mathbf{y}) = \min_{j:y_j=1} \text{cost}(j\rightarrow i)
$
where cost ( j → i ) \text{cost}(j\rightarrow i) cost(j→i) is the cost traverse from node j to node i in the hierarchy, maybe path length or weighted path length.

Verspoor et al.: Hierarchical versions of precision, recall and F-score, but these are more expensive.

Condensing (压缩) sort and Selection Algorithm for HMC
It is a review.
It can be used on both tree and DAG hierarchies.

It solves this optimization objective via a greedy algorithm called condensing sort and selection algorithm:
max ⁡ { ψ i } i ∈ H ∑ i ∈ H ψ i y ~ i s . t . ψ i ≤ ψ ↑ ( i ) , ∀ i ∈ H ∖ { 0 } , ψ 0 = 1 , ψ i ∈ { 0 , 1 } , ∑ i = 0 N − 1 ψ i = L \begin{aligned} & \max_{\{\psi_i\}_{i \in \mathcal{H}}} \sum_{i \in \mathcal{H}} \psi_i \widetilde{y}_i \\ s.t. \qquad & \psi_i \leq \psi_{\uparrow(i)}, \forall i \in \mathcal{H}\setminus \{0\},\\ & \psi_0 = 1, \psi_i \in \{0, 1\}, \\ & \sum_{i=0}^{N-1} \psi_i = L \end{aligned} s.t.​{ψi​}i∈H​max​i∈H∑​ψi​y ​i​ψi​≤ψ↑(i)​,∀i∈H∖{0},ψ0​=1,ψi​∈{0,1},i=0∑N−1​ψi​=L​

where ψ i = 1 \psi_i = 1 ψi​=1 indicates that node i i i is predicted positive in y ^ \hat{\mathbf{y}} y^​; and 0 otherwise.

When the label hierarchy is a DAG, the first constraint of the above objective has to be replaced to
ψ i ≤ ψ j , ∀ i ∈ H ∖ { 0 } , ∀ j ∈ ⇑ ( i ) . \psi_i \leq \psi_j, \forall i \in \mathcal{H} \setminus \{0\}, \forall j \in \Uparrow(i). ψi​≤ψj​,∀i∈H∖{0},∀j∈⇑(i).

Extending Flatten loss
This paper extends Hamming Loss and Ranking Loss to support hierarchy,

For hierarchical hamming loss:
ℓ H-hamming = α ∑ i : y i = 1 ∧ y ^ i = 0 c i + β ∑ i : y i = 0 ∧ y ^ i = 1 c i \ell_{\text{H-hamming}} = \alpha \sum_{i: y_i = 1 \wedge \hat{y}_i = 0} c_i + \beta \sum_{i: y_i = 0 \wedge \hat{y}_i = 1} c_i ℓH-hamming​=αi:yi​=1∧y^​i​=0∑​ci​+βi:yi​=0∧y^​i​=1∑​ci​

DAG class hierarchy derives
c i = { 1 , i = 0 , ∑ j ∈ ⇑ ( i ) c j n ↓ ( j ) , i > 0 c_i = \left\{ \begin{aligned} & 1, & i = 0, \\ & \sum_{j \in \Uparrow(i)} \frac{c_j}{n_{\downarrow(j)}}, & i > 0 \end{aligned} \right. ci​=⎩ ⎨ ⎧​​1,j∈⇑(i)∑​n↓(j)​cj​​,​i=0,i>0​
where n n n is the number of children of node j j j.

There are special cases in origin papaer, but it is easy and not discussed here.

For hierarchical ranking loss:
ℓ H-rank = ∑ ( i , j ) : y i > y j c i j ( I ( y ^ i < y ^ j ) + 1 2 I ( y ^ i = y ^ j ) ) , \ell_{\text{H-rank}} = \sum_{(i,j):y_i > y_j} c_{ij} (I(\hat{y}_i < \hat{y}_j) + \frac{1}{2}I(\hat{y}_i = \hat{y}_j)), ℓH-rank​=(i,j):yi​>yj​∑​cij​(I(y^​i​<y^​j​)+21​I(y^​i​=y^​j​)),

where c i j = c i c j c_{ij} = c_ic_j cij​=ci​cj​ ensures a high penalty when an upper-level positive label is ranked after a lower-level negative label.

Minimizing the risk
The conditional risks (or simply the risk) R ( y ^ ) \mathcal{R}(\hat{\mathbf{y}}) R(y^​) of predicting multilabel y ^ \hat{\mathbf{y}} y^​ is the expectation of ℓ ( y ^ , y ) \ell(\mathbf{\hat{y}},\mathbf{y}) ℓ(y^​,y) over all possible y y y’s as ground truth, i.e.,
( 期望风险极小化 ) arg min ⁡ y ^ ∈ Ω R ( y ^ ) = ∑ y ℓ ( y ^ , y ) P ( y ∣ x ) . (期望风险极小化) \argmin_{\hat{\mathbf{y}} \in \Omega} \mathcal{R}(\mathbf{\hat{y}}) = \sum_{\mathbf{y}} \ell(\hat{\mathbf{y}}, \mathbf{y}) P(\mathbf{y} | \mathbf{x}). (期望风险极小化)y^​∈Ωargmin​R(y^​)=y∑​ℓ(y^​,y)P(y∣x).

There are three issues to be addressed:
(1) Estimating P ( y ∣ x ) P(\mathbf{y}|\mathbf{x}) P(y∣x).
(2) Computing R ( y ^ ) \mathcal{R}(\hat{\mathbf{y}}) R(y^​) without exhaustively searching.
(3) Minimizing R ( y ^ ) \mathcal{R}(\mathbf{\hat{y}}) R(y^​).

This paper computes p i p_i pi​ through chain rule, and the risk is transferred into different forms for different losses.
The risk for matching loss:
R match ( y ^ ) = ∑ i : y ^ i = 0 ϕ ( i , y ^ ) + ∑ i : y ^ i q i \mathcal{R}_{\text{match}}(\hat{\mathbf{y}}) = \sum_{i:\hat{y}_i = 0} \phi(i, \hat{\mathbf{y}}) + \sum_{i: \hat{y}_i} q_i Rmatch​(y^​)=i:y^​i​=0∑​ϕ(i,y^​)+i:y^​i​∑​qi​

where q i = ∑ j = 0 d ( i ) − 1 ∑ l = j + 1 d ( i ) c ⇑ l ( i ) P ( y ⇑ 0 : j ( i ) = 1 , y ⇑ j + 1 ( i ) = 0 ∣ x ) q_i = \sum_{j=0}^{d(i)-1}\sum_{l=j+1}^{d(i)} c_{\Uparrow_l(i)} P(\mathbf{y}_{\Uparrow_{0:j}(i)} = \mathbf{1}, y_{\Uparrow_{j+1}(i)} = 0 | \mathbf{x}) qi​=∑j=0d(i)−1​∑l=j+1d(i)​c⇑l​(i)​P(y⇑0:j​(i)​=1,y⇑j+1​(i)​=0∣x), d ( i ) d(i) d(i) is the depth of node i i i. ⇑ j ( i ) \Uparrow_j(i) ⇑j​(i) is the i i i’s ancestor at depth j, ⇑ 0 : j ( i ) = { ⇑ 0 ( i ) , … , ⇑ j ( i ) } \Uparrow_{0:j}(i) = \{\Uparrow_0(i), \dots, \Uparrow_j(i)\} ⇑0:j​(i)={⇑0​(i),…,⇑j​(i)} is the set of i i i’s ancestors at depths 0 t0 j.

The risk for hierarchical hamming loss:
R H-hamming ( y ^ ) = α ∑ i : y ^ i = 0 c i p i + β ∑ i : y ^ i = 1 c i ( 1 − p i ) \mathcal{R}_{\text{H-hamming}}(\hat{\mathbf{y}}) = \alpha \sum_{i:\hat{y}_i = 0} c_i p_i + \beta \sum_{i:\hat{y}_i=1} c_i(1 - p_i) RH-hamming​(y^​)=αi:y^​i​=0∑​ci​pi​+βi:y^​i​=1∑​ci​(1−pi​)

The risk for hierarchical ranking loss:
R H-rank ( y ^ ) = ∑ 0 ≤ i < j ≤ N − 1 c i j ( p i I ( y ^ i ≤ y ^ j ) + p j I ( y ^ i ≥ y ^ j ) + p i + p j 2 I ( y ^ i = y ^ j ) ) − C \mathcal{R}_{\text{H-rank}}(\mathbf{\hat{y}}) = \sum_{0 \leq i < j \leq N-1} c_{ij}(p_i I (\hat{y}_i \leq \hat{y}_j) + p_j I(\hat{y}_i \geq \hat{y}_j) + \frac{p_i+p_j}{2}I(\hat{y}_i = \hat{y}_j)) - C RH-rank​(y^​)=0≤i<j≤N−1∑​cij​(pi​I(y^​i​≤y^​j​)+pj​I(y^​i​≥y^​j​)+2pi​+pj​​I(y^​i​=y^​j​))−C

Efficient minimizing the risk:
y ^ = arg min ⁡ L = 1 , … , N R ( y ^ ( L ) ⋆ ) , \hat{\mathbf{y}} = \argmin_{L = 1,\dots,N} \mathcal{R}(\mathbf{\hat{y}}^\star_{(L)}), y^​=L=1,…,Nargmin​R(y^​(L)⋆​),
where
y ^ ( L ) ⋆ = arg min ⁡ y ^ ∈ Ω R ( y ^ ) : ∣ supp ( y ^ ) ∣ = L \mathbf{\hat{y}}^\star_{(L)} = \argmin_{\hat{\mathbf{y}}\in \Omega} \mathcal{R}(\hat{\mathbf{y}}): |\text{supp}(\hat{\mathbf{y}})| = L y^​(L)⋆​=y^​∈Ωargmin​R(y^​):∣supp(y^​)∣=L
where supp ( f ) : = { x ∈ X ∣ f ( x ) ≠ 0 } \text{supp}(f) := \{x \in X | f(x) \neq 0\} supp(f):={x∈X∣f(x)=0} is the support of f f f.

实际上是比较朴素也比较容易理解的优化目标,通过按照positive label的数量来分别优化, 也就是different L L L.

This paper adopts the CSSAG (压缩排序与选择算法 proposed by Bi.) for tree label hierarchy, which is a greedy strategy.

Conclusions

This paper extends matching loss, hamming loss and ranking loss to support tree-type as well as DAG-type class hierarchies.
This paper seems easy to be understood without much innovations, but organized well with strong comprehensiveness, so it is published on TKDE.

《Bayes-Optimal Hierarchical Multi-label Classificaiton》-TKDE相关推荐

  1. Paper:《YOLOv4: Optimal Speed and Accuracy of Object Detection》的翻译与解读

    Paper:<YOLOv4: Optimal Speed and Accuracy of Object Detection>的翻译与解读 目录 YOLOv4的评价 1.四个改进和一个创新 ...

  2. 论文解读《Co-Correcting:Noise-tolerant Medical Image Classification via mutual Label Correction》

    论文解读<Co-Correcting:Noise-tolerant Medical Image Classification via mutual Label Correction> 论文 ...

  3. 《MA‑CRNN: a multi‑scale attention CRNN for Chinese text line recognition in natural scenes》论文阅读

    参考博文: CRNN的一个变种,可以读一读,看看相对于CRNN来说有什么变化?以及为什么? 文章目录 make decision step1:读摘要 step2:读Introduction step3 ...

  4. 《Densely Connected Hierarchical Network for Image Denoising》阅读笔记

    一.论文 <Densely Connected Hierarchical Network for Image Denoising> 近年来,深度卷积神经网络已应用于众多图像处理研究中,并且 ...

  5. 表情识别综述论文《Deep Facial Expression Recognition: A Survey》中文翻译

    本篇博客为论文<Deep Facial Expression Recognition: A Survey>的中文翻译,如有翻译错误请见谅,同时希望您能为我提出改正建议,谢谢! 论文链接:h ...

  6. 《YOLOX: Exceeding YOLO Series in 2021》阅读

    文章下载: YOLOX-Exceeding YOLO Series in 2021.pdf 摘要 本篇文章中,我们展示了在 YOLO 系列检测器上的改进,并获得了一个高性能的目标检测器 -- YOLO ...

  7. 【论文学习】《A Survey on Neural Speech Synthesis》

    <A Survey on Neural Speech Synthesis>论文学习 文章目录 <A Survey on Neural Speech Synthesis>论文学习 ...

  8. Paper:《CatBoost: unbiased boosting with categorical features》的翻译与解读

    Paper:<CatBoost: unbiased boosting with categorical features>的翻译与解读 目录 <CatBoost: unbiased ...

  9. 深度学习论文阅读图像分类篇(五):ResNet《Deep Residual Learning for Image Recognition》

    深度学习论文阅读图像分类篇(五):ResNet<Deep Residual Learning for Image Recognition> Abstract 摘要 1. Introduct ...

最新文章

  1. 最牛逼的 Java 日志框架,性能无敌,横扫所有对手.....
  2. ssh隧道 学习总结
  3. 根据数据库表gengxin实体类_Python学习第四十八天记录打call:SQLALchemy操作MySQL关系型数据库...
  4. 将代码从windows移动linux上出现^M错误的解决方法
  5. 【Java面试题视频讲解】字符串按指定长度分隔
  6. MySql连接——内连接、外连接(左连接、右连接、全连接)
  7. python工程技巧_重点来了!掌握这些Python技巧,将给你节省大量时间
  8. linux mysql授权远程登录,Linux中 MySQL 授权远程连接的方法步骤
  9. 网页修改iPhone13在线源码 – 无需SVIP
  10. JSP→JavaWeb简介、Tomcat服务器安装启动测试目录、Tomcat手动创建项目、开发工具MyEclipse与Eclipse配置环境、MyEclipse创建Web程序目录、修改Tomcat端口
  11. LMAX Disruptor——一个高性能、低延迟且简单的框架
  12. (转)鼎晖投资总裁焦震:别把投资高雅化,就是个做买卖的
  13. oracle如何查询明细账,新纪元通用账证查询打印软件常用问题解答.docx
  14. 解决ThinkServer TS250中网卡在centos6.5中没有安装驱动(驱动安装)
  15. ajax原生详解,原生Ajax详解
  16. 安装系统或者进PE蓝屏 代码:IRQL NOT LESS OR EQUAL
  17. 光波传输的角谱理论【理论,实例及matlab代码】
  18. Snapde电子表格支持的文件格式
  19. 计算机远程控制相关考题,北邮远程计算机试题和答案.docx
  20. 世界正游弋于开源之海,但只有 Red Hat 从中盈利

热门文章

  1. 【​观察】SAP Analytics Cloud在华落地,用哪“三把钥匙”打开分析云市场?
  2. 兼容 向前兼容 向后兼容
  3. 漫画图解:垃圾分类背后的数据和真相
  4. 外包公司干了不到 3 个月,我离职了...(防坑指南)
  5. vs19c++求arccos值
  6. Java日期Date类,格式化,年月日时分秒星期
  7. 前端Ajax请求超时处理
  8. 我所理解的快速傅里叶变换(FFT)
  9. 【软件工程】CASE
  10. 腾讯OCR-身份证识别API使用