Dynamic-OFA: Runtime DNN Architecture Switching for Performance Scaling on Heterogeneous Embedded Platforms

本文用LUT做的run-time management, 就是在offline中先用accuracy predictor和latency predictor去预测sub-network延迟和acc,然后存在LUT里,在runtime的时候online时候 runtime manager读LUT,找满足要求的配置,配置是指不同精度和latency的模型,论文里是不同level

all in all是分两步,offline评估和online schedule

  • Pre-sample a family of sub-networkfrom a static OFA (one for all network) and contains a runtime manager to choose different sub-networks under different runtime environments. 从OFA中预采样小网络,用runtime manager在不同runtime 环境下选择不同的子网络。
  • image-20220513161432733

​ advantage:

 1. search from once-for-all network (OFA) and don't need retrain. 直接从OFA搜索不需要retrain1. Scale DNN architecute including width, depth, filter size and input resolution for both GPU and CPU with one shared backbone. (CPU: more layers and less channel. GPU: more channel, less layers)

Two steps:

  1. offline step: accuracy and latency of sub-networks are evaluated to find a family of efficient sub-networks on the Pareto-front for both CPU and GPU. (offline评估accuracy 和latency, 在CPU 和GPU 上找一组子网络)
  2. run-time manager to switch between the optimal subnetwork based on the runtime accuarcy and latency requirement of the applicaiton and the available resources on the platform. (根据accuracy、latency和硬件资源要求switch子网络)


  1. dynamic DNN with OFA. 用OFA的动态DNN
  2. search algorithm for subnetwork。 根据accuracy和latency搜索subnetwork
  3. runtime approach for switch. Runtime 子网络switch满足性能和硬件资源限制的runtime approach
  • backbone network: once-for-all OFA

    The approach identifies a family if efficient sub-network on Pareto-front for each compuation element in heterogeneous platform and pre-calculates batch-norm parameters for those sub-networks offline.

    search on server and less time needed.

  • Optimal sub-network architecture search:

    previous alg only find model which is under time constraint.

    This paper proposed a search alg to find sub-networks under certain latency constraints and have better accuracy.

    Accuracy predictor: three-layer NN trained with 5000 networks and accuracy

    latency predictor: use LUT to record operation time(conv pooling), CPU GPU use different LUT

    image-20220513194520370 image-20220513194931286

    随机获得的subnetwork 有的是sub-optimal负优化的,红线上的才是能够被用building Dynamic-OFA.

  • B-N is pretrained during design time

  • Runtime architecuture switching

    • one single dynamic-OFA: directly search in LUT
    • Two workload share CPU/GPU: can’t directly use LUT but RTM can gradually change the sub-network to tradeoff.


backbone: Mobilenet v3 Imagenet

  • Dynamic-OFA 比OFA精度低,因为static OFA微调了

    跟Dynamic FLOPs-accuracy比较


​ run-time switch


  • 只有D-OFA,根据latency和acc的要求,运行不同的subnetwork


    RTM 每十张图片计算latency,RTM switch 15ms

  • 多任务共同运行

    和static DNN 同时运行,逐渐从level4 -> level3 -> level2



A constraint: 65ms

B constraint:55ms

A一开始在最高level运行,B 在level5,A转到level5,B还是慢,然后B到level4


  • pareto tradeoff curve

