
  • 论文信息
    • 问题
    • 解决
    • 方法
    • 效果
  • 1. Introduction
  • 2. Data Processing
  • 3. Proposed Architecture
    • 3.1 LASSO Shrinkage and Majority Voting
    • 3.2 CNN Architecture
      • 3.2.1 Training Schedule
  • 4. Results
    • 4.1 Summary Statistics
    • 4.2 Model Results
    • 4.3. Comparison of ML models
      • 4.3.1. Comparison with state-of-the-art ML models
      • 4.3.2. Our LASSO-CNN vs vanilla CNN
    • 3.3 Data Augementation
      • 3.3.1 Data augmentation
      • 3.3.2 Data undersampling
      • 4.3.3 Data oversampling strategies
      • 4.3.4. Data undersampling strategies
    • 4.4. Validation on stroke data
    • 4.5 Notes on the resilience to data imbalance
  • 5. Conclusion, Limitations and Future Research
    • 5.1 Conclusion
    • 5.2 Limitation
    • 5.3 Future Research


Title :An Efficient Convolutional Neural Network for Coronary Heart Disease Prediction
Journal : Expert Systems with Applications
Year: 2020
Author : Aniruddha Dutta, Tamal Batabyal, Meheli Basu, Scott T. Acton





本研究提出一种带卷积层的神经网络模型,对类不平衡的临床数据 - 冠心病进行分类。


  • 特征选择:使用基于最小绝对收缩和选择算子(LASSO)进行特征权重评估,并基于多数投票法对重要特征识别。

  • 模型训练:模型训练过程中,通过使用 fully connected layer 来均质化重要的特征,这是将层的输出传递到连续卷积层之前的关键步骤。

  • 此外还提出每个 epoch 的 training schedule,类似于模拟退火过程,以提高分类精度。


NHANES 数据集存在较高的类别不平衡问题,本文提出的CNN体系结构在正确分类存在冠心病方面的分类能力为77%,在测试数据上准确分类冠心病病例的能力为81.8%,占总数据集的85.70%。

1. Introduction

Our architecture is simple in design, elegant in concept, sophisticated in training schedule, effective in outcome with far-reaching applicability in problems with unbalanced datasets.


  • our model uses a variable elimination technique using LASSO and feature voting as preprocessing steps;
  • we leverage a shallow neural network with convolutional layers, which improves CHD prediction rates compared to existing models with comparable subjects (the ‘shallowness’ is dictated by the scarcity of class-specific data to prevent overfitting of the network during training);
  • in conjunction with the architecture, we propose a simulated annealing-like training schedule that is shown to minimize the generalization error between train and test losses.

2. Data Processing

由37,079名 (冠心病-1300人,非冠心病-35,779人) 的人口统计、检查、实验室和问卷数据组合而成,如图1所示。

Fig. 1 Data compilation from National Health and Nutritional Survey (NHANES). The data is acquired from 1999 to 2016 in three categories – Demography, Examination and Laboratory. Based on the nature of the factors that are considered, the dataset contains both the quantitative and the qualitative variables.

总共使用了 30 个连续变量和 6 个分类变量来预测冠心病。


3. Proposed Architecture

3.1 LASSO Shrinkage and Majority Voting

LASSO 或最小绝对收缩和选择算子是一种回归技术,用于变量选择和正则化,以提高其产生的统计模型的预测精度和可解释性。

LASSO 是一个二项问题,目标是最小化如下目标函数:

∑i=1n(yi−∑jxijγj)2+λ∑j=1p∣γj∣\sum_{i=1}^n(y_i - \sum_j x_{ij} \gamma_j)^2 + \lambda \sum_{j=1}^p |\gamma_j| i=1n(yijxijγj)2+λj=1pγj

  • λ\lambdaλ 是收缩量的调整参数,控制正则化惩罚的强度。λ=0\lambda =0λ=0,不会消除任何参数。随着λ\lambdaλ 的增加,更多系数被设置为0,并消除。
  • λ\lambdaλ 增加,偏差增加,λ\lambdaλ 减小,方差增加。
  • 变量(因子)的 γ\gammaγ 值可以解释为变量的重要性,即该变量对数据中潜在变化的贡献。γ\gammaγ为零的变量被认为不重要。

为减轻不平衡的影响,采用了对数据集进行随机细分采样和多次迭代 LASSO 的策略。对该组 γ\gammaγ 值执行多数投票,以标识在主要迭代次数中非零的变量。假设在 NNN个随机二次抽样数据集上执行 LASSO NNN 次,其中每个 instance 在CHD和非CHD情况下具有相等数量的samples。

LASSO的第 iii 个 instance,得到 γi=[γi,1γi,2......γi,45]\gamma_i = [\gamma_{i,1}\gamma_{i,2}......\gamma_{i,45}]γi=[γi,1γi,2......γi,45]


