文章目录

  • Fake Quant 简介
  • Reference
  • Quantize weights
  • Quantize activation data

Fake Quant 简介

为什么要做模型量化:Deep learning models are typically trained with floating point data but they can quantized into integers during inference without any loss of performance (i.e. accuracy).

量化什么:Quantizing models includes quantizing both the weights and activation data (or layer input/outputs).

量化方式:In this work, we quantize the floating point weights/activation data to Qm.n format, where m,n are fixed within a layer but can vary across different network layers.

Fake quant 之所以叫伪量化,是因为虽然可量化weights/activations,但不是真正意义上的量化,即变量类型还是floating point,而不是integer。但是值却经过了量化操作和反量化操作。之所以要做这样的量化,是因为简单方便,且后期可移植到c上进行定点化(可参考Qm.n format)。

Reference

[1] Arm: ML-KWS-for-MCU

Github: https://github.com/ARM-software/ML-KWS-for-MCU

Quant guide: https://github.com/ARM-software/ML-KWS-for-MCU/blob/master/Deployment/Quant_guide.md

[2] Deepxi

github: https://github.com/anicolson/DeepXi

[3] tf.quantization

tensorflow 2.3:
tf.quantization: https://tensorflow.google.cn/api_docs/python/tf/quantization

tf.quantization.fake_quant_with_min_max_args: https://tensorflow.google.cn/api_docs/python/tf/quantization/fake_quant_with_min_max_args

tensorflow 1.15:
tf.quantization:
https://tensorflow.google.cn/versions/r1.15/api_docs/python/tf/quantization

【4】keras模型查看每层输入输出:https://editor.csdn.net/md/?articleId=110677379

注:这里均基于tensorflow2.3实现。

Quantize weights

Quantizing weights is fairly simple, as the weights are fixed after the training and we know their min/max range. Using these ranges, the weights are quantized or discretized to 256 levels. Here is the code snippet for quantizing the weights and biases to 8-bit integers.

min_wt = weight.min()
max_wt = weight.max()#find number of integer bits to represent this range
int_bits = int(np.ceil(np.log2(max(abs(min_wt),abs(max_wt)))))
frac_bits = 7-int_bits #remaining bits are fractional bits (1-bit for sign)
#floating point weights are scaled and rounded to [-128,127], which are used in
#the fixed-point operations on the actual hardware (i.e., microcontroller)
quant_weight = np.round(weight*(2**frac_bits))#To quantify the impact of quantized weights, scale them back to
#original range to run inference using quantized weights
weight = quant_weight/(2**frac_bits)

实际使用:

以语音增强模型 deepxi 为例 (https://github.com/anicolson/DeepXi):

# tensorflow 2.3 def quant_weights_and_biases(deepxi):model_variables = deepxi.model.variablesfor v in model_variables:min_value = tf.reduce_min(v)max_value = tf.reduce_max(v) #var_values.max()int_bits = tf.cast(tf.math.ceil(tf.math.log(tf.math.maximum(tf.abs(min_value), tf.abs(max_value))) / tf.math.log(2.0)), dtype=tf.int32)dec_bits = tf.math.subtract(7, int_bits)new_v = tf.round(tf.math.multiply(v, tf.cast(tf.math.pow(2, dec_bits), dtype=tf.float32)))new_v = tf.math.divide(new_v, tf.cast(tf.math.pow(2, dec_bits), dtype=tf.float32))v.assign(new_v)# test: comparing model_variables and quant_model_variables quant_model_variables = deepxi.model.variablesreturn deepxi

全部代码:包括1)创建模型结构,2)载入已训练好模型参数,3)模型测试,4)伪量化及测试对比。

## create model from args:
import osfrom deepxi.args import get_args
import numpy as npfrom deepxi.network.attention import MHANet, AttentionMaskV2
from deepxi.se_batch import Batch
from deepxi.model import DeepXi
from tensorflow import keras# model test:
from tqdm import tqdm
from deepxi.utils import read_mat, save_wav
import tensorflow as tf# class MHANETV2
from tensorflow.keras.layers import Activation, Add, \Conv1D, Layer, LayerNormalization, Masking, ReLUdef create_model():args = get_args()if args.causal:args.padding = "causal"else:args.padding = "same"args.model_path = args.model_path + '/' + args.ver  # model save path.if args.set_path != "set": args.data_path = args.data_path + '/' + args.set_path.rsplit('/', 1)[-1]  # data path.N_d = int(args.f_s * args.T_d * 0.001)  # window duration (samples).N_s = int(args.f_s * args.T_s * 0.001)  # window shift (samples).K = int(pow(2, np.ceil(np.log2(N_d))))  # number of DFT components.if True:test_x, test_x_len, _, test_x_base_names = Batch(args.test_x_path)deepxi = DeepXi(N_d=N_d,N_s=N_s,K=K,sample_dir=args.data_path,train_s_list=None,train_d_list=None,**vars(args))keras.utils.plot_model(deepxi.model, args.ver + '-' + args.network_type + '.png', show_shapes=True)return deepxidef load_variables_for_model(deepxi,savedmodel_path = '../deepxi_data_model_out/saved_model/mhanet-1.0c/epoch-199/variables/variables'):# testmodel_variables = deepxi.model.variables# load variablsdeepxi.model.load_weights(savedmodel_path)# testpretrained_model_variables = deepxi.model.variablesreturn deepxidef fake_quant_weights_and_biases(deepxi):model_variables = deepxi.model.variablesfor v in model_variables:min_value = tf.reduce_min(v)max_value = tf.reduce_max(v) #var_values.max()int_bits = tf.cast(tf.math.ceil(tf.math.log(tf.math.maximum(tf.abs(min_value), tf.abs(max_value))) / tf.math.log(2.0)), dtype=tf.int32)dec_bits = tf.math.subtract(7, int_bits)new_v = tf.round(tf.math.multiply(v, tf.cast(tf.math.pow(2, dec_bits), dtype=tf.float32)))new_v = tf.math.divide(new_v, tf.cast(tf.math.pow(2, dec_bits), dtype=tf.float32))v.assign(new_v)# testquant_model_variables = deepxi.model.variablesreturn deepxidef save_model(deepxi, savedmodel_path='./my_model'):deepxi.model.save(savedmodel_path)def test_model(deepxi, denoise_path='./out'):args = get_args()out_type = args.out_type    #'y'gain = args.gain            #'mmse-lsa'e = args.max_epochs         # 200out_path = args.out_path + '/' + deepxi.ver + '/e' + str(e) + '/' + out_type + '/' + gainout_path = denoise_path + '/' + deepxi.ver + '/e' + str(e) + '/' + out_type + '/' + gain# mkdirif os.path.exists(out_path) == False:os.makedirs(out_path)test_x, test_x_len, _, test_x_base_names = Batch(args.test_x_path)print("Processing observations...")inp_batch, supplementary_batch, n_frames = deepxi.observation_batch(test_x, test_x_len)print("Performing inference...")tgt_hat_batch = deepxi.model.predict(inp_batch, batch_size=1, verbose=1)print("Saving outputs...")batch_size = len(test_x_len)for i in tqdm(range(batch_size)):base_name = test_x_base_names[i]inp = inp_batch[i, :n_frames[i], :]tgt_hat = tgt_hat_batch[i, :n_frames[i], :]# if tf.is_tensor(supplementary_batch):supplementary = supplementary_batch[i, :n_frames[i], :]saved_data_path = args.saved_data_pathif args.saved_data_path is not None:saved_data = read_mat(saved_data_path + '/' + base_name + '.mat')supplementary = (supplementary, saved_data)if out_type == 'y':y = deepxi.inp_tgt.enhanced_speech(inp, supplementary, tgt_hat, gain).numpy()save_wav(out_path + '/' + base_name + '.wav', y, deepxi.inp_tgt.f_s)x = tf.cast(test_x[i, :test_x_len[i]] / 32768, tf.float32).numpy()numsamples = np.min((len(x), len(y)))xy = tf.stack((x[:numsamples], y[:numsamples]), axis=-1)save_wav(out_path + '/' + base_name + '.wav', xy.numpy(), deepxi.inp_tgt.f_s)else:raise ValueError('Invalid output type.')if __name__ == '__main__':# # Step 1: create and load pretrained model, and testdeepxi = create_model()pretrained_deepxi = load_variables_for_model(deepxi)test_model(pretrained_deepxi, denoise_path='./out/pretrained_model')# # Step 2: fake_quant and test modelquant_deepxi = fake_quant_weights_and_biases(deepxi)test_model(pretrained_deepxi, denoise_path='./out/quant_pretrained_model')print('done')##############################################################################
## args:
# --ver
# mhanet-1.0c
# --network
# MHANetV2
# --d_model
# 256
# --n_blocks
# 5
# --n_heads
# 8
# --warmup_steps
# 40000
# --causal
# 1
# --outp_act
# Sigmoid
# --loss_fnc
# BinaryCrossentropy
# --max_epochs
# 200
# --resume_epoch
# 0
# --test_epoch
# 200
# --mbatch_size
# 4
# --inp_tgt_type
# MagXi
# --map_type
# DBNormalCDF
# --sample_size
# 1000
# --f_s
# 16000
# --T_d
# 32
# --T_s
# 16
# --min_snr
# -10
# --max_snr
# 20
# --snr_inter
# 1
# --out_type
# y
# --save_model
# 1
# --log_iter
# 0
# --eval_example
# 1
# --gain
# mmse-lsa
# --train
# 0
# --infer
# 1
# --test
# 0
# --gpu
# 0
# --set_path
# ../deepxi_dataset/deep_xi_training_set
# --data_path
# ../deepxi_data_model_out/data
# --test_x_path
# ../deepxi_dataset/deepxi_test_set/test_noisy_speech_100
# --test_s_path
# /home/user/tmp/t/Deepxi_data_model_out/test_clean_speech
# --test_d_path
# /home/user/tmp/t/Deepxi_data_model_out/test_noise
# --out_path
# ../deepxi_data_model_out/out
# --model_path
# ../deepxi_data_model_out/saved_model

Quantize activation data

使用代表性数据集:One approach for quantizing the activation data is to run inference on some representative input samples (or ideally, the entire dataset) and find the min/max range of each layer input/output. Using these ranges, the activation data can be quantized similar to the weights as shown in the above code snippet.

受限于数据集的有限性:Any outliers in the dataset may increase this range and may impact the accuracy, hence care must be taken in this approach.

使用fake_quant_with_min_max_args: Other approach is to insert the TensorFlow Op “fake_quant_with_min_max_args” after every operation (convolution, addition, multiplication or concatenation) and find the optimal power of 2 min,max ranges that maximize the accuracy.

This same approach can also be used for quantizing the weights. Furthermore, this modified model with fake_quant_with_min_max_args Op and frozen min,max ranges can be used for retraining/fine-tuning, which may increase the accuracy as the network will adapt to quantization.

具体步骤分两步,第一步:根据给定数据集,统计activation的最大最小值;第二步,创建有根据activation max创建的fake_quant_with_min_max_args层(在激活层之前,如送入relu层之前)

示例如下:
Step 1:统计activation的最大值(最大值:最大最小值的绝对值的最大值),即activation层的输入层的最大值。(keras模型查看每层输入输出可参考:https://editor.csdn.net/md/?articleId=110677379)

def generate_activation_max(deepxi, testing_path = '../deepxi_dataset/deepxi_test_set/test_noisy_speech_100'):from tensorflow.keras import backend as Kfrom deepxi.se_batch import Batchfrom tensorflow import kerasimport numpy as npimport tensorflow as tfinp = deepxi.model.input  # inputinputs = [layer.input for layer in deepxi.model.layers if (isinstance(layer, keras.layers.ReLU) or isinstance(layer, keras.layers.Activation))]  # activation layer inputsoutputs = [layer.output for layer in deepxi.model.layers if (isinstance(layer, keras.layers.ReLU) or isinstance(layer, keras.layers.Activation))]functors_inp = [K.function([inp], [input]) for input in inputs]functors_outp = [K.function([inp], [output]) for output in outputs]# Testingtest_x, test_x_len, _, test_x_base_names = Batch(testing_path)print("Processing observations...")inp_batch, supplementary_batch, n_frames = deepxi.observation_batch(test_x, test_x_len)layer_ins = [func(inp_batch) for func in functors_inp]layer_outs = [func([inp_batch, 1.]) for func in functors_outp]act_max = np.zeros(shape=(len(layer_ins)), dtype=np.int)layer_id = 0for layer_in in layer_ins:min_value = tf.reduce_min(layer_in)max_value = tf.reduce_max(layer_in)act_max[layer_id] = tf.cast(tf.math.ceil(tf.math.log(tf.math.maximum(tf.abs(min_value), tf.abs(max_value))) / tf.math.log(2.0)), dtype=tf.int32)act_max[layer_id] = tf.cast(tf.math.pow(2, act_max[layer_id]), dtype=tf.int8)layer_id += 1return act_max

Step 2: 根据actication max创建有fake_quant_with_min_max_args层的模型

from tensorflow.python.keras.layers import Conv1D, ReLU, LayerNormalization, Add, Activation
from deepxi.network.attention import MultiHeadAttention, AttentionMask, AttentionMaskV2
import tensorflow as tf
import tensorflow_addons as tfaclass MHANet_QuantAct:"""Multi-head attention network."""def __init__(self,inp,n_outp,d_model,n_blocks,n_heads,warmup_steps,causal,outp_act,# inp_all,):"""Argument/s:inp - input placeholder.n_outp - number of outputs.d_model - model size.n_blocks - number of blocks.n_heads - number of attention heads.warmup_steps - number of warmup steps.causal - causal flag.outp_act - output activation function."""self.n_outp = n_outpself.d_model = d_modelself.n_blocks = n_blocksself.n_heads = n_headsself.d_ff = d_model*4self.warmup_steps = warmup_stepsself.d_k = self.d_model // self.n_heads# if self.inp_all is None:#     self.inp_all = tf.zeros(shape=(None, max_speech_len, 257))# else:#    self.inpatt_mask, seq_mask = AttentionMask(causal, -1.0e9)(inp)x = Conv1D(self.d_model, 1, use_bias=False)(inp)x = LayerNormalization(axis=2, epsilon=1e-6, center=True, scale=True)(x)x = ReLU()(x)for _ in range(self.n_blocks): x = self.block(x, att_mask, seq_mask)self.outp = Conv1D(self.n_outp, 1, use_bias=True)(x)if outp_act == "Sigmoid": self.outp = Activation('sigmoid')(self.outp)elif outp_act == "ReLU": self.outp = ReLU()(self.outp)elif outp_act == "Linear": self.outp = self.outpelse: raise ValueError("Invalid outp_act")def block(self, x, att_mask, seq_mask):"""MHANet block.Argument/s:x - input.att_mask - attention mask.seq_mask - sequence mask.Returns:layer_2 - output of second layer."""layer_1 = MultiHeadAttention(d_model=self.d_model,n_heads=self.n_heads)(x, x, x, att_mask, seq_mask)layer_1 = Add()([x, layer_1])layer_1 = LayerNormalization(axis=2, epsilon=1e-6, center=True,scale=True)(layer_1)layer_2 = self.feed_forward_network(layer_1)layer_2 = Add()([layer_1, layer_2])layer_2 = LayerNormalization(axis=2, epsilon=1e-6, center=True,scale=True)(layer_2)return layer_2def feed_forward_network(self, x):"""Feed forward network.Argument/s:inp - input placeholder.Returns:x - output of second feed forward layer."""x = Conv1D(self.d_ff, 1, use_bias=True)(x)x = ReLU()(x)x = Conv1D(self.d_model, 1, use_bias=True)(x)return xclass MHANetV2_QuantAct(MHANet_QuantAct):"""Multi-head attention network implemented using tfa.layers.MultiHeadAttention."""def __init__(self,inp,n_outp,d_model,n_blocks,n_heads,warmup_steps,causal,outp_act,act_max,):"""Argument/s:inp - input placeholder.n_outp - number of outputs.d_model - model size.n_blocks - number of blocks.n_heads - number of attention heads.warmup_steps - number of warmup steps.causal - causal flag.outp_act - output activation function."""self.n_outp = n_outpself.d_model = d_modelself.n_blocks = n_blocksself.n_heads = n_headsself.d_ff = d_model*4self.warmup_steps = warmup_stepsself.d_k = self.d_model // self.n_heads# add by zhaodengself.act_max = act_maxatt_mask = AttentionMaskV2(causal)(inp)x = Conv1D(self.d_model, 1, use_bias=False)(inp)x = LayerNormalization(axis=2, epsilon=1e-6, center=True, scale=True)(x)##########################################################################################################################tf.quantization.fake_quant_with_min_max_vars########################################relu_layer_no = 0if self.act_max[relu_layer_no] > 0:x = tf.quantization.fake_quant_with_min_max_vars(x,min=-self.act_max[relu_layer_no],max=self.act_max[relu_layer_no] - (self.act_max[relu_layer_no] / 128.0),num_bits=8)relu_layer_no += 1#######################################################################################################x = ReLU()(x)for _ in range(self.n_blocks):x = self.block(x, att_mask, relu_layer_no)relu_layer_no += 1self.outp = Conv1D(self.n_outp, 1, use_bias=True)(x)##########################################################################################################################tf.quantization.fake_quant_with_min_max_vars########################################if self.act_max[relu_layer_no] > 0:self.outp = tf.quantization.fake_quant_with_min_max_vars(self.outp,min=-self.act_max[relu_layer_no],max=self.act_max[relu_layer_no] - (self.act_max[relu_layer_no] / 128.0),num_bits=8)#######################################################################################################if outp_act == "Sigmoid": self.outp = Activation('sigmoid')(self.outp)elif outp_act == "ReLU": self.outp = ReLU()(self.outp)elif outp_act == "Linear": self.outp = self.outpelse: raise ValueError("Invalid outp_act")def block(self, x, att_mask, relu_layer_no):"""MHANet block.Argument/s:x - input.att_mask - attention mask.Returns:layer_2 - output of second layer."""layer_1 = tfa.layers.MultiHeadAttention(head_size=self.d_k,num_heads=self.n_heads,output_size=self.d_model,dropout=0.0,use_projection_bias=False,)([x, x, x, att_mask])layer_1 = Add()([x, layer_1])layer_1 = LayerNormalization(axis=2, epsilon=1e-6, center=True, scale=True)(layer_1)layer_2 = self.feed_forward_network(layer_1, relu_layer_no)layer_2 = Add()([layer_1, layer_2])layer_2 = LayerNormalization(axis=2, epsilon=1e-6, center=True, scale=True)(layer_2)return layer_2def feed_forward_network(self, x, relu_layer_no):"""Feed forward network.Argument/s:inp - input placeholder.Returns:x - output of second feed forward layer."""x = Conv1D(self.d_ff, 1, use_bias=True)(x)##########################################################################################################################tf.quantization.fake_quant_with_min_max_vars########################################if self.act_max[relu_layer_no] > 0:x = tf.quantization.fake_quant_with_min_max_vars(x,min=-self.act_max[relu_layer_no],max=self.act_max[relu_layer_no] - (self.act_max[relu_layer_no] / 128.0),num_bits=8)#######################################################################################################x = ReLU()(x)x = Conv1D(self.d_model, 1, use_bias=True)(x)return x

最后根据act_max创建模型即可。并可在该量化activation的模型上再量化weights。

【Keras模型量化】之 Fake Quant(tf.quantization)相关推荐

  1. keras 模型量化

    """ #coding:utf-8 __project_ = 'TF2learning' __file_name__ = 'quantization' __author_ ...

  2. 模型量化论文阅读#4----EWGS:Network Quantization with Element-wise Gradient Scaling

    在量化中,因为量化函数是不可微分的,所以一旦涉及到量化函数的反向传播时,就需要对量化函数的梯度进行近似,目前常用的近似是STE,从而避免量化函数的零梯度问题.所谓STE就是一个直通器,它只是传播相同的 ...

  3. 【tf.keras】tf.keras模型复现

    keras 构建模型很简单,上手很方便,同时又是 tensorflow 的高级 API,所以学学也挺好. 模型复现在我们的实验中也挺重要的,跑出了一个模型,虽然我们可以将模型的 checkpoint ...

  4. 【1】谷歌2021模型量化白皮书《A White Paper on Neural Network Quantization》

    2021 Google模型量化白皮书 摘要 引言 模型量化的理论基础 硬件背景 均匀仿射量化 对称均匀量化 2的幂次方量化 量化粒度 量化模拟 Batch normalization folding ...

  5. list python 转tensor_Tensorflow模型量化4 --pb转tflite(uint8量化)小结

    Tensorflow模型量化4 --pb转tflite小结(uint8量化) 实验环境:tensorflow-gpu1.15+cuda10.0 模型的fp16量化和int8量化我之前有写,参考: 龟龟 ...

  6. 一次失败的Pytorch模型量化尝试

    我的原工程模型是blazeface学习笔记_zhqh100的博客-CSDN博客完整的应该是一个人脸识别项目,人脸识别,大言不惭的说,我之前其实也做过,比如用dlib来做人脸识别,就是用opencv那一 ...

  7. 【杂谈】当前模型量化有哪些可用的开源工具?

    模型量化属于模型优化中的重要技术之一,是非常有效地提升模型推理速度的技术方案,那么当前有哪些可用的模型量化工具呢? 作者&编辑 | 言有三 1 Tensorflow Lite TensorFl ...

  8. tensorflow量化感知训练_tensorflow模型量化实例

    1,概述 模型量化应该是现在最容易实现的模型压缩技术,而且也基本上是在移动端部署的模型的毕竟之路.模型量化基本可以分为两种:post training quantizated和quantization ...

  9. 模型压缩:模型量化打怪升级之路-工具篇

    本文转载自商汤泰坦公开课. 1/ 最近发现一些还在学校读书的同学非常关注一个量化工作精度的高低,读过我上篇分享(模型压缩:模型量化打怪升级之路 - 0 序章)的同学应该知道,部分学术界的工作与工业界的 ...

  10. Pytorch模型量化实践并以ResNet18模型量化为例(附代码)

    更多.更及时内容欢迎微信公众号:小窗幽记机器学习 围观,后续会进一步整理模型推理加速和部署方面的相关内容. 文章目录 量化基础知识 映射函数 量化参数 校准(Calibration) Affine和S ...

最新文章

  1. python pexpect telnet_使用python的pexpect模块,实现远程免密登录的示例
  2. mybatis面试题讲解1
  3. ubuntu/debian/centos/rhel使用镜像源一键安装gitlab-ce服务
  4. Xcode6的 实时渲染 在storyboard修改自定义属性
  5. 又到了上云时刻啦!!!阿里云 Docker部署SpringBoot项目 方便测试的部署方式
  6. CentOS、Ubuntu、Gentoo
  7. SSH-keygen参数说明
  8. Servlet的快速入门以及执行原理
  9. 前端 input怎么显示null_前端架构 101(二): MVC 初探
  10. MyBatis中resuleMap一对一和一对多属性字段映射
  11. css按钮口诀 - CSS BUG顺口溜
  12. mysql应用基础_MySQL基础应用
  13. 基于python的多光谱影像植被指数计算
  14. 【图像处理技术】 | 黑科技解读 之 PS检测、弯曲拉平、切边增强、摩尔纹
  15. 你的GitHub代码已打包运往北极,传给1000年后人类!网友:我的Bug还没修复...
  16. 19年绝响,张国荣「复活」!AI高清修复《热·情》燃爆2000万观众
  17. MPlayer播放器源码分析 2012
  18. linux内核5.8.1,Linus Torvalds宣布大规模更新Linux内核5.8
  19. 计算机中丢失cg,【计算机中丢失dll文件】计算机丢失dll文件_计算机中丢失cg dll-win7之家...
  20. docker国内镜像加速配置

热门文章

  1. 网络ACL(NACL)
  2. 极速洞察联合CMRA发布《2021五一出行消费体验报告》:深入、多维分析民众出行特点
  3. 当“种菜”结合了人工智能,会创造怎样的奇迹?
  4. uni-app 微信小程序开发过程中遇到的问题及解决方案-问题汇总(1)
  5. 国信安web安全——SQLi-Labs第1-8关
  6. 句子时态三:完成态(现在完成时、过去完成时、将来完成时、过去将来完成时)
  7. java代码判断当前开发环境_业务代码如何判断生产/开发环境
  8. ubuntu搭建HTTP/FPT/TFTP/NTP/DNS/NFS服务器
  9. Building for iOS Simulator, but the linked......错误的解决方案,Xcode 12
  10. 范宇飞主任:干细胞治疗肝硬化前景光明,但仍需继续研究