《Bayes-Optimal Hierarchical Multi-label Classificaiton》-TKDE

This paper systematically concludes the classical loss functions for hierarchical multi-label classification (HMC), and extends the Hamming loss and Ranking loss to support class hierarchy.
Reading Difficulty： ⋆ ⋆ \star\star ⋆⋆
Creativity: ⋆ ⋆ \star\star ⋆⋆
Comprehensiveness (全面性)： ⋆ ⋆ ⋆ ⋆ ⋆ \star\star\star\star\star ⋆⋆⋆⋆⋆

Symbol System:

Symbol	Meaning
y i ∈ { 0 , 1 } y_i \in \{0,1\} yi∈{0,1}	The label for class i i i
↑ ( i ) , ↓ ( i ) , ⇑ ( i ) , ⇓ ( i ) , ⇔ ( i ) \uparrow(i),\downarrow(i),\Uparrow(i),\Downarrow(i),\Leftrightarrow(i) ↑(i),↓(i),⇑(i),⇓(i),⇔(i)	The parent, children, ancestors, descentors, and sibilings of node i i i
y i ∈ { 0 , 1 } i \mathbf{y}_{\mathbf{i}} \in \{0,1\}^\mathbf{i} yi∈{0,1}i	the label vector for classes i \mathbf{i} i
H = { 0 , … , N − 1 } \mathcal{H} = \{0,\dots,N-1\} H={0,…,N−1}	The class hierachy, where N N N is the number of nodes
I ( x ) I(x) I(x)	An indicator function output 1 when x is true, 0 otherwise.
R \mathcal{R} R	The conditional risk

Hierarchy Constraints
In HMC, if the label structure is a tree, we have:
y i = 1 ⇒ y ↑ ( i ) = 1. y_i = 1 \Rightarrow y_{\uparrow(i)} = 1. yi=1⇒y↑(i)=1.

For the DAG-type HMC with, there are two interpretations:

AND-interpretation. We have y i = 1 ⇒ y ↑ ( i ) = 1 y_i=1 \Rightarrow y_{\uparrow(i)} = \mathbf{1} yi=1⇒y↑(i)=1
OR-interpretation. We have y i = 1 ⇒ ∃ y ↑ ( c ) = 1 y_i=1 \Rightarrow \exist y_{\uparrow(c)} = 1 yi=1⇒∃y↑(c)=1

Loss functions for Flat and Hierarchical Classification
It is a review.

Zero-one loss:
ℓ 0 / 1 ( y ^ , y ) = I ( y ^ ≠ y ) \ell_{0/1}(\hat{\mathbf{y}}, \mathbf{y}) = I(\hat{\mathbf{y}}\neq \mathbf{y}) ℓ0/1(y^,y)=I(y^=y)

Hamming loss:
ℓ hamming ( y ^ , y ) = ∑ i ∈ H I ( y ^ i ≠ y i ) \ell_{\text{hamming}}(\mathbf{\hat{y}},\mathbf{y}) = \sum_{i \in \mathcal{H}} I(\hat{y}_i \neq y_i) ℓhamming(y^,y)=∑i∈HI(y^i=yi)

Top- k k k precision:
k k k most-confident predicted positive labels for each sample.
top-k-precision ( y ^ , y ) = The number of TP predictions in the top-k labels of y ^ k \text{top-k-precision}(\hat{\mathbf{y}}, \mathbf{y}) = \frac{\text{The number of TP predictions in the top-k labels of } \hat{\mathbf{y}}}{k} top-k-precision(y^,y)=kThe number of TP predictions in the top-k labels of y^
So the loss is
ℓ top-k = 1 − top-k-precision \ell_{\text{top-k}} = 1 - \text{top-k-precision} ℓtop-k=1−top-k-precision

Ranking loss:
ℓ rank = ∑ ( i , j ) : y i > y j ( I ( y i ^ < y ^ j ) + I ( y i ^ = y ^ j ) 2 ) \ell_{\text{rank}} = \sum_{(i,j):y_i > y_j} (I(\hat{y_i} < \hat{y}_j) + \frac{I(\hat{y_i} = \hat{y}_j)}{2}) ℓrank=∑(i,j):yi>yj(I(yi^<y^j)+2I(yi^=y^j))

Hierarchical Multi-class Classificaiton
A review.
Note: Only a single path can be predicted positive.

Cai and Hofmann:
ℓ = ∑ i ∈ H c i I ( y ^ i ≠ y i ) \ell = \sum_{i \in \mathcal{H}} c_i I(\hat{y}_i \neq y_i) ℓ=∑i∈HciI(y^i=yi)
where c i c_i ci is the cost for node i i i.

Dekel et al. :
It seems that this loss is complicated.
But this paper treats this loss as similar to the former loss?

Hierarchical multi-label classification

H-Loss:
ℓ H = α ∑ i : y i = 1 , y ^ i = 0 c i I ( y ^ ⇑ ( i ) = y ⇑ ( i ) ) + β ∑ i : y i = 0 , y ^ i = 1 c i I ( y ^ ⇑ ( i ) = y ⇑ ( i ) ) \ell_H = \alpha \sum_{i:y_i=1,\hat{y}_i=0} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) + \beta \sum_{i:y_i=0,\hat{y}_i = 1} c_i I(\hat{\mathbf{y}}_{\Uparrow(i)} = \mathbf{y}_{\Uparrow(i)}) ℓH=α∑i:yi=1,y^i=0ciI(y^⇑(i)=y⇑(i))+β∑i:yi=0,y^i=1ciI(y^⇑(i)=y⇑(i))
where α \alpha α and β \beta β are weight for FN and FP.

Often, misclassifications at upper class level are considered more expensive than those at the lower levels.
Thus, there are a cost assigning approach
c i = { 1 , i = 0, c ⇑ ( i ) n ⇔ ( i ) , i > 0 , c_i = \left\{ \begin{aligned} & 1, & \text{ i = 0,} \\ & \frac{c_{\Uparrow(i)}}{n_{\Leftrightarrow(i)}}, & \text{ i > 0}, \end{aligned}\right. ci=⎩ ⎨ ⎧1,n⇔(i)c⇑(i), i = 0, i > 0,
where n ⇔ ( i ) n_{\Leftrightarrow(i)} n⇔(i) is the number of siblings of i i i (including i i i).

Matching Loss:
ℓ match = α ∑ i : y i = 1 ϕ ( i , y ^ ) + β ∑ i : y ^ i = 1 ϕ ( i , y ) \ell_{\text{match}} = \alpha \sum_{i:y_i=1}\phi(i, \hat{\mathbf{y}}) + \beta \sum_{i:\hat{y}_i = 1} \phi(i, \mathbf{y}) ℓmatch=α∑i:yi=1ϕ(i,y^)+β∑i:y^i=1ϕ(i,y).
where
$
\phi(i,\mathbf{y}) = \min_{j:y_j=1} \text{cost}(j\rightarrow i)
$
where cost ( j → i ) \text{cost}(j\rightarrow i) cost(j→i) is the cost traverse from node j to node i in the hierarchy, maybe path length or weighted path length.

Verspoor et al.: Hierarchical versions of precision, recall and F-score, but these are more expensive.

Condensing (压缩) sort and Selection Algorithm for HMC
It is a review.
It can be used on both tree and DAG hierarchies.

It solves this optimization objective via a greedy algorithm called condensing sort and selection algorithm:
max ⁡ { ψ i } i ∈ H ∑ i ∈ H ψ i y ~ i s . t . ψ i ≤ ψ ↑ ( i ) , ∀ i ∈ H ∖ { 0 } , ψ 0 = 1 , ψ i ∈ { 0 , 1 } , ∑ i = 0 N − 1 ψ i = L \begin{aligned} & \max_{\{\psi_i\}_{i \in \mathcal{H}}} \sum_{i \in \mathcal{H}} \psi_i \widetilde{y}_i \\ s.t. \qquad & \psi_i \leq \psi_{\uparrow(i)}, \forall i \in \mathcal{H}\setminus \{0\},\\ & \psi_0 = 1, \psi_i \in \{0, 1\}, \\ & \sum_{i=0}^{N-1} \psi_i = L \end{aligned} s.t.{ψi}i∈Hmaxi∈H∑ψiy iψi≤ψ↑(i),∀i∈H∖{0},ψ0=1,ψi∈{0,1},i=0∑N−1ψi=L

where ψ i = 1 \psi_i = 1 ψi=1 indicates that node i i i is predicted positive in y ^ \hat{\mathbf{y}} y^; and 0 otherwise.

When the label hierarchy is a DAG, the first constraint of the above objective has to be replaced to
ψ i ≤ ψ j , ∀ i ∈ H ∖ { 0 } , ∀ j ∈ ⇑ ( i ) . \psi_i \leq \psi_j, \forall i \in \mathcal{H} \setminus \{0\}, \forall j \in \Uparrow(i). ψi≤ψj,∀i∈H∖{0},∀j∈⇑(i).

Extending Flatten loss
This paper extends Hamming Loss and Ranking Loss to support hierarchy,

For hierarchical hamming loss:
ℓ H-hamming = α ∑ i : y i = 1 ∧ y ^ i = 0 c i + β ∑ i : y i = 0 ∧ y ^ i = 1 c i \ell_{\text{H-hamming}} = \alpha \sum_{i: y_i = 1 \wedge \hat{y}_i = 0} c_i + \beta \sum_{i: y_i = 0 \wedge \hat{y}_i = 1} c_i ℓH-hamming=αi:yi=1∧y^i=0∑ci+βi:yi=0∧y^i=1∑ci

DAG class hierarchy derives
c i = { 1 , i = 0 , ∑ j ∈ ⇑ ( i ) c j n ↓ ( j ) , i > 0 c_i = \left\{ \begin{aligned} & 1, & i = 0, \\ & \sum_{j \in \Uparrow(i)} \frac{c_j}{n_{\downarrow(j)}}, & i > 0 \end{aligned} \right. ci=⎩ ⎨ ⎧1,j∈⇑(i)∑n↓(j)cj,i=0,i>0
where n n n is the number of children of node j j j.

There are special cases in origin papaer, but it is easy and not discussed here.

For hierarchical ranking loss:
ℓ H-rank = ∑ ( i , j ) : y i > y j c i j ( I ( y ^ i < y ^ j ) + 1 2 I ( y ^ i = y ^ j ) ) , \ell_{\text{H-rank}} = \sum_{(i,j):y_i > y_j} c_{ij} (I(\hat{y}_i < \hat{y}_j) + \frac{1}{2}I(\hat{y}_i = \hat{y}_j)), ℓH-rank=(i,j):yi>yj∑cij(I(y^i<y^j)+21I(y^i=y^j)),

where c i j = c i c j c_{ij} = c_ic_j cij=cicj ensures a high penalty when an upper-level positive label is ranked after a lower-level negative label.

Minimizing the risk
The conditional risks (or simply the risk) R ( y ^ ) \mathcal{R}(\hat{\mathbf{y}}) R(y^) of predicting multilabel y ^ \hat{\mathbf{y}} y^ is the expectation of ℓ ( y ^ , y ) \ell(\mathbf{\hat{y}},\mathbf{y}) ℓ(y^,y) over all possible y y y’s as ground truth, i.e.,
( 期望风险极小化 ) arg min ⁡ y ^ ∈ Ω R ( y ^ ) = ∑ y ℓ ( y ^ , y ) P ( y ∣ x ) . (期望风险极小化) \argmin_{\hat{\mathbf{y}} \in \Omega} \mathcal{R}(\mathbf{\hat{y}}) = \sum_{\mathbf{y}} \ell(\hat{\mathbf{y}}, \mathbf{y}) P(\mathbf{y} | \mathbf{x}). (期望风险极小化)y^∈ΩargminR(y^)=y∑ℓ(y^,y)P(y∣x).

There are three issues to be addressed:
(1) Estimating P ( y ∣ x ) P(\mathbf{y}|\mathbf{x}) P(y∣x).
(2) Computing R ( y ^ ) \mathcal{R}(\hat{\mathbf{y}}) R(y^) without exhaustively searching.
(3) Minimizing R ( y ^ ) \mathcal{R}(\mathbf{\hat{y}}) R(y^).

This paper computes p i p_i pi through chain rule, and the risk is transferred into different forms for different losses.
The risk for matching loss:
R match ( y ^ ) = ∑ i : y ^ i = 0 ϕ ( i , y ^ ) + ∑ i : y ^ i q i \mathcal{R}_{\text{match}}(\hat{\mathbf{y}}) = \sum_{i:\hat{y}_i = 0} \phi(i, \hat{\mathbf{y}}) + \sum_{i: \hat{y}_i} q_i Rmatch(y^)=i:y^i=0∑ϕ(i,y^)+i:y^i∑qi

where q i = ∑ j = 0 d ( i ) − 1 ∑ l = j + 1 d ( i ) c ⇑ l ( i ) P ( y ⇑ 0 : j ( i ) = 1 , y ⇑ j + 1 ( i ) = 0 ∣ x ) q_i = \sum_{j=0}^{d(i)-1}\sum_{l=j+1}^{d(i)} c_{\Uparrow_l(i)} P(\mathbf{y}_{\Uparrow_{0:j}(i)} = \mathbf{1}, y_{\Uparrow_{j+1}(i)} = 0 | \mathbf{x}) qi=∑j=0d(i)−1∑l=j+1d(i)c⇑l(i)P(y⇑0:j(i)=1,y⇑j+1(i)=0∣x), d ( i ) d(i) d(i) is the depth of node i i i. ⇑ j ( i ) \Uparrow_j(i) ⇑j(i) is the i i i’s ancestor at depth j, ⇑ 0 : j ( i ) = { ⇑ 0 ( i ) , … , ⇑ j ( i ) } \Uparrow_{0:j}(i) = \{\Uparrow_0(i), \dots, \Uparrow_j(i)\} ⇑0:j(i)={⇑0(i),…,⇑j(i)} is the set of i i i’s ancestors at depths 0 t0 j.

The risk for hierarchical hamming loss:
R H-hamming ( y ^ ) = α ∑ i : y ^ i = 0 c i p i + β ∑ i : y ^ i = 1 c i ( 1 − p i ) \mathcal{R}_{\text{H-hamming}}(\hat{\mathbf{y}}) = \alpha \sum_{i:\hat{y}_i = 0} c_i p_i + \beta \sum_{i:\hat{y}_i=1} c_i(1 - p_i) RH-hamming(y^)=αi:y^i=0∑cipi+βi:y^i=1∑ci(1−pi)

The risk for hierarchical ranking loss:
R H-rank ( y ^ ) = ∑ 0 ≤ i < j ≤ N − 1 c i j ( p i I ( y ^ i ≤ y ^ j ) + p j I ( y ^ i ≥ y ^ j ) + p i + p j 2 I ( y ^ i = y ^ j ) ) − C \mathcal{R}_{\text{H-rank}}(\mathbf{\hat{y}}) = \sum_{0 \leq i < j \leq N-1} c_{ij}(p_i I (\hat{y}_i \leq \hat{y}_j) + p_j I(\hat{y}_i \geq \hat{y}_j) + \frac{p_i+p_j}{2}I(\hat{y}_i = \hat{y}_j)) - C RH-rank(y^)=0≤i<j≤N−1∑cij(piI(y^i≤y^j)+pjI(y^i≥y^j)+2pi+pjI(y^i=y^j))−C

Efficient minimizing the risk:
y ^ = arg min ⁡ L = 1 , … , N R ( y ^ ( L ) ⋆ ) , \hat{\mathbf{y}} = \argmin_{L = 1,\dots,N} \mathcal{R}(\mathbf{\hat{y}}^\star_{(L)}), y^=L=1,…,NargminR(y^(L)⋆),
where
y ^ ( L ) ⋆ = arg min ⁡ y ^ ∈ Ω R ( y ^ ) : ∣ supp ( y ^ ) ∣ = L \mathbf{\hat{y}}^\star_{(L)} = \argmin_{\hat{\mathbf{y}}\in \Omega} \mathcal{R}(\hat{\mathbf{y}}): |\text{supp}(\hat{\mathbf{y}})| = L y^(L)⋆=y^∈ΩargminR(y^):∣supp(y^)∣=L
where supp ( f ) : = { x ∈ X ∣ f ( x ) ≠ 0 } \text{supp}(f) := \{x \in X | f(x) \neq 0\} supp(f):={x∈X∣f(x)=0} is the support of f f f.

实际上是比较朴素也比较容易理解的优化目标，通过按照positive label的数量来分别优化，也就是different L L L.

This paper adopts the CSSAG (压缩排序与选择算法 proposed by Bi.) for tree label hierarchy, which is a greedy strategy.

Conclusions

This paper extends matching loss, hamming loss and ranking loss to support tree-type as well as DAG-type class hierarchies.
This paper seems easy to be understood without much innovations, but organized well with strong comprehensiveness, so it is published on TKDE.

《Bayes-Optimal Hierarchical Multi-label Classificaiton》-TKDE相关推荐

Paper：《YOLOv4: Optimal Speed and Accuracy of Object Detection》的翻译与解读
Paper:<YOLOv4: Optimal Speed and Accuracy of Object Detection>的翻译与解读目录 YOLOv4的评价 1.四个改进和一个创新 ...
论文解读《Co-Correcting:Noise-tolerant Medical Image Classification via mutual Label Correction》
论文解读<Co-Correcting:Noise-tolerant Medical Image Classification via mutual Label Correction> 论文 ...
《MA‑CRNN: a multi‑scale attention CRNN for Chinese text line recognition in natural scenes》论文阅读
参考博文: CRNN的一个变种,可以读一读,看看相对于CRNN来说有什么变化?以及为什么? 文章目录 make decision step1:读摘要 step2:读Introduction step3 ...
《Densely Connected Hierarchical Network for Image Denoising》阅读笔记
一.论文 <Densely Connected Hierarchical Network for Image Denoising> 近年来,深度卷积神经网络已应用于众多图像处理研究中,并且 ...
表情识别综述论文《Deep Facial Expression Recognition: A Survey》中文翻译
本篇博客为论文<Deep Facial Expression Recognition: A Survey>的中文翻译,如有翻译错误请见谅,同时希望您能为我提出改正建议,谢谢! 论文链接:h ...
《YOLOX: Exceeding YOLO Series in 2021》阅读
文章下载: YOLOX-Exceeding YOLO Series in 2021.pdf 摘要本篇文章中,我们展示了在 YOLO 系列检测器上的改进,并获得了一个高性能的目标检测器 -- YOLO ...
【论文学习】《A Survey on Neural Speech Synthesis》
<A Survey on Neural Speech Synthesis>论文学习文章目录 <A Survey on Neural Speech Synthesis>论文学习 ...
Paper：《CatBoost: unbiased boosting with categorical features》的翻译与解读
Paper:<CatBoost: unbiased boosting with categorical features>的翻译与解读目录 <CatBoost: unbiased ...
深度学习论文阅读图像分类篇（五）：ResNet《Deep Residual Learning for Image Recognition》
深度学习论文阅读图像分类篇(五):ResNet<Deep Residual Learning for Image Recognition> Abstract 摘要 1. Introduct ...

《Bayes-Optimal Hierarchical Multi-label Classificaiton》-TKDE

《Bayes-Optimal Hierarchical Multi-label Classificaiton》-TKDE相关推荐

最新文章

热门文章