《Bayes-Optimal Hierarchical Multi-label Classificaiton》-TKDE
This paper systematically concludes the classical loss functions for hierarchical multi-label classification (HMC), and extends the Hamming loss and Ranking loss to support class hierarchy.
Reading Difficulty: ⋆ ⋆ \star\star ⋆⋆
Creativity: ⋆ ⋆ \star\star ⋆⋆
Comprehensiveness (全面性): ⋆ ⋆ ⋆ ⋆ ⋆ \star\star\star\star\star ⋆⋆⋆⋆⋆
Symbol System:
Symbol | Meaning |
---|---|
y i ∈ { 0 , 1 } y_i \in \{0,1\} yi∈{0,1} | The label for class i i i |
↑ ( i ) , ↓ ( i ) , ⇑ ( i ) , ⇓ ( i ) , ⇔ ( i ) \uparrow(i),\downarrow(i),\Uparrow(i),\Downarrow(i),\Leftrightarrow(i) ↑(i),↓(i),⇑(i),⇓(i),⇔(i) | The parent, children, ancestors, descentors, and sibilings of node i i i |
y i ∈ { 0 , 1 } i \mathbf{y}_{\mathbf{i}} \in \{0,1\}^\mathbf{i} yi∈{0,1}i | the label vector for classes i \mathbf{i} i |
H = { 0 , … , N − 1 } \mathcal{H} = \{0,\dots,N-1\} H={0,…,N−1} | The class hierachy, where N N N is the number of nodes |
I ( x ) I(x) I(x) | An indicator function output 1 when x is true, 0 otherwise. |
R \mathcal{R} R | The conditional risk |
Hierarchy Constraints
In HMC, if the label structure is a tree, we have:
y i = 1 ⇒ y ↑ ( i ) = 1. y_i = 1 \Rightarrow y_{\uparrow(i)} = 1. yi=1⇒y↑(i)=1.
For the DAG-type HMC with, there are two interpretations:
AND-interpretation. We have y i = 1 ⇒ y ↑ ( i ) = 1 y_i=1 \Rightarrow y_{\uparrow(i)} = \mathbf{1} yi=1⇒y↑(i)=1
OR-interpretation. We have y i = 1 ⇒ ∃ y ↑ ( c ) = 1 y_i=1 \Rightarrow \exist y_{\uparrow(c)} = 1 yi=1⇒∃y↑(c)=1
Loss functions for Flat and Hierarchical Classification
It is a review.
Zero-one loss:
ℓ 0 / 1 ( y ^ , y ) = I ( y ^ ≠ y ) \ell_{0/1}(\hat{\mathbf{y}}, \mathbf{y}) = I(\hat{\mathbf{y}}\neq \mathbf{y}) ℓ0/1(y^,y)=I(y^=y)
Hamming loss:
ℓ hamming ( y ^ , y ) = ∑ i ∈ H I ( y ^ i ≠ y i ) \ell_{\text{hamming}}(\mathbf{\hat{y}},\mathbf{y}) = \sum_{i \in \mathcal{H}} I(\hat{y}_i \neq y_i) ℓhamming(y^,y)=∑i∈HI(y^i=yi)
Top- k k k precision:
k k k most-confident predicted positive labels for each sample.
top-k-precision ( y ^ , y ) = The number of TP predictions in the top-k labels of y ^ k \text{top-k-precision}(\hat{\mathbf{y}}, \mathbf{y}) = \frac{\text{The number of TP predictions in the top-k labels of } \hat{\mathbf{y}}}{k} top-k-precision(y^,y)=kThe number of TP predictions in the top-k labels of y^
So the loss is
ℓ top-k = 1 − top-k-precision \ell_{\text{top-k}} = 1 - \text{top-k-precision} ℓtop-k=1−top-k-precision
Ranking loss:
ℓ rank = ∑ ( i , j ) : y i > y j ( I ( y i ^ < y ^ j ) + I ( y i ^ = y ^ j ) 2 ) \ell_{\text{rank}} = \sum_{(i,j):y_i > y_j} (I(\hat{y_i} < \hat{y}_j) + \frac{I(\hat{y_i} = \hat{y}_j)}{2}) ℓrank=∑(i,j):yi>yj(I(yi^<y^j)+2I(yi^=y^j))
Hierarchical Multi-class Classificaiton
A review.
Note: Only a single path can be predicted positive.
Cai and Hofmann:
ℓ = ∑ i ∈ H c i I ( y ^ i ≠ y i ) \ell = \sum_{i \in \mathcal{H}} c_i I(\hat{y}_i \neq y_i) ℓ=∑i∈HciI(y^i=yi)
where c i c_i ci is the cost for node i i i.
Dekel et al. :
It seems that this loss is complicated.
But this paper treats this loss as similar to the former loss?
Hierarchical multi-label classification
H-Loss:
ℓ H = α ∑ i : y i = 1 , y ^ i = 0 c i I ( y ^ ⇑ ( i ) = y ⇑ ( i ) ) + β ∑ i : y i = 0 , y ^ i = 1 c i I ( y ^ ⇑ ( i ) = y ⇑ ( i ) ) \ell_H = \alpha \sum_{i:y_i=1,\hat{y}_i=0} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) + \beta \sum_{i:y_i=0,\hat{y}_i = 1} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) ℓH=α∑i:yi=1,y^i=0ciI(y^⇑(i)=y⇑(i))+β∑i:yi=0,y^i=1ciI(y^⇑(i)=y⇑(i))
where α \alpha α and β \beta β are weight for FN and FP.
Often, misclassifications at upper class level are considered more expensive than those at the lower levels.
Thus, there are a cost assigning approach
c i = { 1 , i = 0, c ⇑ ( i ) n ⇔ ( i ) , i > 0 , c_i = \left\{ \begin{aligned} & 1, & \text{ i = 0,} \\ & \frac{c_{\Uparrow(i)}}{n_{\Leftrightarrow(i)}}, & \text{ i > 0}, \end{aligned}\right. ci=⎩ ⎨ ⎧1,n⇔(i)c⇑(i), i = 0, i > 0,
where n ⇔ ( i ) n_{\Leftrightarrow(i)} n⇔(i) is the number of siblings of i i i (including i i i).
Matching Loss:
ℓ match = α ∑ i : y i = 1 ϕ ( i , y ^ ) + β ∑ i : y ^ i = 1 ϕ ( i , y ) \ell_{\text{match}} = \alpha \sum_{i:y_i=1}\phi(i, \hat{\mathbf{y}}) + \beta \sum_{i:\hat{y}_i = 1} \phi(i, \mathbf{y}) ℓmatch=α∑i:yi=1ϕ(i,y^)+β∑i:y^i=1ϕ(i,y).
where
$
\phi(i,\mathbf{y}) = \min_{j:y_j=1} \text{cost}(j\rightarrow i)
$
where cost ( j → i ) \text{cost}(j\rightarrow i) cost(j→i) is the cost traverse from node j to node i in the hierarchy, maybe path length or weighted path length.
Verspoor et al.: Hierarchical versions of precision, recall and F-score, but these are more expensive.
Condensing (压缩) sort and Selection Algorithm for HMC
It is a review.
It can be used on both tree and DAG hierarchies.
It solves this optimization objective via a greedy algorithm called condensing sort and selection algorithm:
max { ψ i } i ∈ H ∑ i ∈ H ψ i y ~ i s . t . ψ i ≤ ψ ↑ ( i ) , ∀ i ∈ H ∖ { 0 } , ψ 0 = 1 , ψ i ∈ { 0 , 1 } , ∑ i = 0 N − 1 ψ i = L \begin{aligned} & \max_{\{\psi_i\}_{i \in \mathcal{H}}} \sum_{i \in \mathcal{H}} \psi_i \widetilde{y}_i \\ s.t. \qquad & \psi_i \leq \psi_{\uparrow(i)}, \forall i \in \mathcal{H}\setminus \{0\},\\ & \psi_0 = 1, \psi_i \in \{0, 1\}, \\ & \sum_{i=0}^{N-1} \psi_i = L \end{aligned} s.t.{ψi}i∈Hmaxi∈H∑ψiy iψi≤ψ↑(i),∀i∈H∖{0},ψ0=1,ψi∈{0,1},i=0∑N−1ψi=L
where ψ i = 1 \psi_i = 1 ψi=1 indicates that node i i i is predicted positive in y ^ \hat{\mathbf{y}} y^; and 0 otherwise.
When the label hierarchy is a DAG, the first constraint of the above objective has to be replaced to
ψ i ≤ ψ j , ∀ i ∈ H ∖ { 0 } , ∀ j ∈ ⇑ ( i ) . \psi_i \leq \psi_j, \forall i \in \mathcal{H} \setminus \{0\}, \forall j \in \Uparrow(i). ψi≤ψj,∀i∈H∖{0},∀j∈⇑(i).
Extending Flatten loss
This paper extends Hamming Loss and Ranking Loss to support hierarchy,
For hierarchical hamming loss:
ℓ H-hamming = α ∑ i : y i = 1 ∧ y ^ i = 0 c i + β ∑ i : y i = 0 ∧ y ^ i = 1 c i \ell_{\text{H-hamming}} = \alpha \sum_{i: y_i = 1 \wedge \hat{y}_i = 0} c_i + \beta \sum_{i: y_i = 0 \wedge \hat{y}_i = 1} c_i ℓH-hamming=αi:yi=1∧y^i=0∑ci+βi:yi=0∧y^i=1∑ci
DAG class hierarchy derives
c i = { 1 , i = 0 , ∑ j ∈ ⇑ ( i ) c j n ↓ ( j ) , i > 0 c_i = \left\{ \begin{aligned} & 1, & i = 0, \\ & \sum_{j \in \Uparrow(i)} \frac{c_j}{n_{\downarrow(j)}}, & i > 0 \end{aligned} \right. ci=⎩ ⎨ ⎧1,j∈⇑(i)∑n↓(j)cj,i=0,i>0
where n n n is the number of children of node j j j.
There are special cases in origin papaer, but it is easy and not discussed here.
For hierarchical ranking loss:
ℓ H-rank = ∑ ( i , j ) : y i > y j c i j ( I ( y ^ i < y ^ j ) + 1 2 I ( y ^ i = y ^ j ) ) , \ell_{\text{H-rank}} = \sum_{(i,j):y_i > y_j} c_{ij} (I(\hat{y}_i < \hat{y}_j) + \frac{1}{2}I(\hat{y}_i = \hat{y}_j)), ℓH-rank=(i,j):yi>yj∑cij(I(y^i<y^j)+21I(y^i=y^j)),
where c i j = c i c j c_{ij} = c_ic_j cij=cicj ensures a high penalty when an upper-level positive label is ranked after a lower-level negative label.
Minimizing the risk
The conditional risks (or simply the risk) R ( y ^ ) \mathcal{R}(\hat{\mathbf{y}}) R(y^) of predicting multilabel y ^ \hat{\mathbf{y}} y^ is the expectation of ℓ ( y ^ , y ) \ell(\mathbf{\hat{y}},\mathbf{y}) ℓ(y^,y) over all possible y y y’s as ground truth, i.e.,
( 期望风险极小化 ) arg min y ^ ∈ Ω R ( y ^ ) = ∑ y ℓ ( y ^ , y ) P ( y ∣ x ) . (期望风险极小化) \argmin_{\hat{\mathbf{y}} \in \Omega} \mathcal{R}(\mathbf{\hat{y}}) = \sum_{\mathbf{y}} \ell(\hat{\mathbf{y}}, \mathbf{y}) P(\mathbf{y} | \mathbf{x}). (期望风险极小化)y^∈ΩargminR(y^)=y∑ℓ(y^,y)P(y∣x).
There are three issues to be addressed:
(1) Estimating P ( y ∣ x ) P(\mathbf{y}|\mathbf{x}) P(y∣x).
(2) Computing R ( y ^ ) \mathcal{R}(\hat{\mathbf{y}}) R(y^) without exhaustively searching.
(3) Minimizing R ( y ^ ) \mathcal{R}(\mathbf{\hat{y}}) R(y^).
This paper computes p i p_i pi through chain rule, and the risk is transferred into different forms for different losses.
The risk for matching loss:
R match ( y ^ ) = ∑ i : y ^ i = 0 ϕ ( i , y ^ ) + ∑ i : y ^ i q i \mathcal{R}_{\text{match}}(\hat{\mathbf{y}}) = \sum_{i:\hat{y}_i = 0} \phi(i, \hat{\mathbf{y}}) + \sum_{i: \hat{y}_i} q_i Rmatch(y^)=i:y^i=0∑ϕ(i,y^)+i:y^i∑qi
where q i = ∑ j = 0 d ( i ) − 1 ∑ l = j + 1 d ( i ) c ⇑ l ( i ) P ( y ⇑ 0 : j ( i ) = 1 , y ⇑ j + 1 ( i ) = 0 ∣ x ) q_i = \sum_{j=0}^{d(i)-1}\sum_{l=j+1}^{d(i)} c_{\Uparrow_l(i)} P(\mathbf{y}_{\Uparrow_{0:j}(i)} = \mathbf{1}, y_{\Uparrow_{j+1}(i)} = 0 | \mathbf{x}) qi=∑j=0d(i)−1∑l=j+1d(i)c⇑l(i)P(y⇑0:j(i)=1,y⇑j+1(i)=0∣x), d ( i ) d(i) d(i) is the depth of node i i i. ⇑ j ( i ) \Uparrow_j(i) ⇑j(i) is the i i i’s ancestor at depth j, ⇑ 0 : j ( i ) = { ⇑ 0 ( i ) , … , ⇑ j ( i ) } \Uparrow_{0:j}(i) = \{\Uparrow_0(i), \dots, \Uparrow_j(i)\} ⇑0:j(i)={⇑0(i),…,⇑j(i)} is the set of i i i’s ancestors at depths 0 t0 j.
The risk for hierarchical hamming loss:
R H-hamming ( y ^ ) = α ∑ i : y ^ i = 0 c i p i + β ∑ i : y ^ i = 1 c i ( 1 − p i ) \mathcal{R}_{\text{H-hamming}}(\hat{\mathbf{y}}) = \alpha \sum_{i:\hat{y}_i = 0} c_i p_i + \beta \sum_{i:\hat{y}_i=1} c_i(1 - p_i) RH-hamming(y^)=αi:y^i=0∑cipi+βi:y^i=1∑ci(1−pi)
The risk for hierarchical ranking loss:
R H-rank ( y ^ ) = ∑ 0 ≤ i < j ≤ N − 1 c i j ( p i I ( y ^ i ≤ y ^ j ) + p j I ( y ^ i ≥ y ^ j ) + p i + p j 2 I ( y ^ i = y ^ j ) ) − C \mathcal{R}_{\text{H-rank}}(\mathbf{\hat{y}}) = \sum_{0 \leq i < j \leq N-1} c_{ij}(p_i I (\hat{y}_i \leq \hat{y}_j) + p_j I(\hat{y}_i \geq \hat{y}_j) + \frac{p_i+p_j}{2}I(\hat{y}_i = \hat{y}_j)) - C RH-rank(y^)=0≤i<j≤N−1∑cij(piI(y^i≤y^j)+pjI(y^i≥y^j)+2pi+pjI(y^i=y^j))−C
Efficient minimizing the risk:
y ^ = arg min L = 1 , … , N R ( y ^ ( L ) ⋆ ) , \hat{\mathbf{y}} = \argmin_{L = 1,\dots,N} \mathcal{R}(\mathbf{\hat{y}}^\star_{(L)}), y^=L=1,…,NargminR(y^(L)⋆),
where
y ^ ( L ) ⋆ = arg min y ^ ∈ Ω R ( y ^ ) : ∣ supp ( y ^ ) ∣ = L \mathbf{\hat{y}}^\star_{(L)} = \argmin_{\hat{\mathbf{y}}\in \Omega} \mathcal{R}(\hat{\mathbf{y}}): |\text{supp}(\hat{\mathbf{y}})| = L y^(L)⋆=y^∈ΩargminR(y^):∣supp(y^)∣=L
where supp ( f ) : = { x ∈ X ∣ f ( x ) ≠ 0 } \text{supp}(f) := \{x \in X | f(x) \neq 0\} supp(f):={x∈X∣f(x)=0} is the support of f f f.
实际上是比较朴素也比较容易理解的优化目标,通过按照positive label的数量来分别优化, 也就是different L L L.
This paper adopts the CSSAG (压缩排序与选择算法 proposed by Bi.) for tree label hierarchy, which is a greedy strategy.
Conclusions
This paper extends matching loss, hamming loss and ranking loss to support tree-type as well as DAG-type class hierarchies.
This paper seems easy to be understood without much innovations, but organized well with strong comprehensiveness, so it is published on TKDE.
《Bayes-Optimal Hierarchical Multi-label Classificaiton》-TKDE相关推荐
- Paper:《YOLOv4: Optimal Speed and Accuracy of Object Detection》的翻译与解读
Paper:<YOLOv4: Optimal Speed and Accuracy of Object Detection>的翻译与解读 目录 YOLOv4的评价 1.四个改进和一个创新 ...
- 论文解读《Co-Correcting:Noise-tolerant Medical Image Classification via mutual Label Correction》
论文解读<Co-Correcting:Noise-tolerant Medical Image Classification via mutual Label Correction> 论文 ...
- 《MA‑CRNN: a multi‑scale attention CRNN for Chinese text line recognition in natural scenes》论文阅读
参考博文: CRNN的一个变种,可以读一读,看看相对于CRNN来说有什么变化?以及为什么? 文章目录 make decision step1:读摘要 step2:读Introduction step3 ...
- 《Densely Connected Hierarchical Network for Image Denoising》阅读笔记
一.论文 <Densely Connected Hierarchical Network for Image Denoising> 近年来,深度卷积神经网络已应用于众多图像处理研究中,并且 ...
- 表情识别综述论文《Deep Facial Expression Recognition: A Survey》中文翻译
本篇博客为论文<Deep Facial Expression Recognition: A Survey>的中文翻译,如有翻译错误请见谅,同时希望您能为我提出改正建议,谢谢! 论文链接:h ...
- 《YOLOX: Exceeding YOLO Series in 2021》阅读
文章下载: YOLOX-Exceeding YOLO Series in 2021.pdf 摘要 本篇文章中,我们展示了在 YOLO 系列检测器上的改进,并获得了一个高性能的目标检测器 -- YOLO ...
- 【论文学习】《A Survey on Neural Speech Synthesis》
<A Survey on Neural Speech Synthesis>论文学习 文章目录 <A Survey on Neural Speech Synthesis>论文学习 ...
- Paper:《CatBoost: unbiased boosting with categorical features》的翻译与解读
Paper:<CatBoost: unbiased boosting with categorical features>的翻译与解读 目录 <CatBoost: unbiased ...
- 深度学习论文阅读图像分类篇(五):ResNet《Deep Residual Learning for Image Recognition》
深度学习论文阅读图像分类篇(五):ResNet<Deep Residual Learning for Image Recognition> Abstract 摘要 1. Introduct ...
最新文章
- 最牛逼的 Java 日志框架,性能无敌,横扫所有对手.....
- ssh隧道 学习总结
- 根据数据库表gengxin实体类_Python学习第四十八天记录打call:SQLALchemy操作MySQL关系型数据库...
- 将代码从windows移动linux上出现^M错误的解决方法
- 【Java面试题视频讲解】字符串按指定长度分隔
- MySql连接——内连接、外连接(左连接、右连接、全连接)
- python工程技巧_重点来了!掌握这些Python技巧,将给你节省大量时间
- linux mysql授权远程登录,Linux中 MySQL 授权远程连接的方法步骤
- 网页修改iPhone13在线源码 – 无需SVIP
- JSP→JavaWeb简介、Tomcat服务器安装启动测试目录、Tomcat手动创建项目、开发工具MyEclipse与Eclipse配置环境、MyEclipse创建Web程序目录、修改Tomcat端口
- LMAX Disruptor——一个高性能、低延迟且简单的框架
- (转)鼎晖投资总裁焦震:别把投资高雅化,就是个做买卖的
- oracle如何查询明细账,新纪元通用账证查询打印软件常用问题解答.docx
- 解决ThinkServer TS250中网卡在centos6.5中没有安装驱动(驱动安装)
- ajax原生详解,原生Ajax详解
- 安装系统或者进PE蓝屏 代码:IRQL NOT LESS OR EQUAL
- 光波传输的角谱理论【理论,实例及matlab代码】
- Snapde电子表格支持的文件格式
- 计算机远程控制相关考题,北邮远程计算机试题和答案.docx
- 世界正游弋于开源之海,但只有 Red Hat 从中盈利