python 版本的TensorRT

什么是TensorRT
- 基本流程
python 中 TensorRT 使用流程之 onnx
- 1、查询自己电脑的 TensorRT 版本：
- 2、查询TensorRT 支持的 onnx 操作
- 3、转换网络模型为 onnx 模型
- - 3、1 确定网络模型
  - 3、2 加载并转换模型
- 4、构建期 -> 转换网络模型为 TRT 模型
- 5、运行期 -> 利用 TRT 模型推理
未完待续。。。

什么是TensorRT

大神解释链接：https://zhuanlan.zhihu.com/p/356072366
简单来说：TensorRT是一种 c++ 版本的神经网络压缩框架，针对不同的主机的计算能力，进行不同的网络最优优化和压缩。
代码参考大神链接：https://github.com/rod-unleashlive/Yolov7_tensorrt

基本流程

构建期
1、创建 Builder（引擎构建器）
2、创建 Network（计算图内容）
3、生成 SerializedNetwork (网络的 TRT 内部表示)
运行期
1、创建 Engine 和Context
2、Buffer 相关准备 ( Host 端(cpu) + Device端(gpu) + 拷贝操作 )
3、执行推理（Execute / Executev2）

python 中 TensorRT 使用流程之 onnx

1、查询自己电脑的 TensorRT 版本：

dpkg -l | grep TensorRT

2、查询TensorRT 支持的 onnx 操作

不是所有的onnx的操作都是被tenosrtt所支持的，我们在转化模型的时候，我们需要去查询相应的版本是否被支持，否则在 onnx 模型转换为 TRT模型的时候，会出现错误。
不同的 TensorRT 版本所支持的操作：https://github.com/onnx/onnx-tensorrt/blob/main/docs/operators.md

3、转换网络模型为 onnx 模型

如果后续想要在不同的机子上话，我们可以先转换为一个中间模型，如 onnx，再在自己的机子上进行适应性模型的转换。

3、1 确定网络模型

以最新的yolov7为例：yolov7的传送门；

3、2 加载并转换模型

按照下面的步骤即可
1、下载 yolov7 和权重

git clone https://github.com/WongKinYiu/yolov7.git

wget https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7x.pt

2、导出 onnx 模型(不包含 NMS )

python yolov7/export.py --weights yolov7-tiny.pt --grid  --simplify

具体的一些代码解释如下：

1、加载和检查模型：

 # Load PyTorch modeldevice = select_device(opt.device)model = attempt_load(opt.weights, map_location=device)  # load FP32 modellabels = model.names# Checksgs = int(max(model.stride))  # grid size (max stride)opt.img_size = [check_img_size(x, gs) for x in opt.img_size]  # verify img_size are gs-multiples# Inputimg = torch.zeros(opt.batch_size, 3, *opt.img_size).to(device)  # image size(1,3,320,192) iDetection# Update modelfor k, m in model.named_modules():m._non_persistent_buffers_set = set()  # pytorch 1.6.0 compatibilityif isinstance(m, models.common.Conv):  # assign export-friendly activationsif isinstance(m.act, nn.Hardswish):m.act = Hardswish()elif isinstance(m.act, nn.SiLU):m.act = SiLU()# elif isinstance(m, models.yolo.Detect):#     m.forward = m.forward_export  # assign forward (optional)model.model[-1].export = not opt.grid  # set Detect() layer grid exporty = model(img)  # dry runif opt.include_nms:model.model[-1].include_nms = Truey = None

2、转换为 onnx 模型

# ONNX exporttry:import onnx# 打印当前 onnx 的版本print('\nStarting ONNX export with onnx %s...' % onnx.__version__)# 将输入的权重的名字变成要输出的 onnx 的名字f = opt.weights.replace('.pt', '.onnx')  # filename# 在转换之前，要将网络变成测试模式，防止权值发生变化model.eval()# 输出的参数的名字，在后面 onnx 导出的时候，和转换为 TRT 模型的时候，都会出现output_names = ['classes', 'boxes'] if y is None else ['output']# 是否输入输出支持动态的变化，可以使得在转换模型之后，你的输入输出也是可以变化的dynamic_axes = None# 是否支持动态的图片大小if opt.dynamic:dynamic_axes = {'images': {0: 'batch', 2: 'height', 3: 'width'},  # size(1,3,640,640)'output': {0: 'batch', 2: 'y', 3: 'x'}}# 是否支持动态的 batch 大小if opt.dynamic_batch:opt.batch_size = 'batch'dynamic_axes = {'images': {0: 'batch',}, }# 是否将后处理也放在导出的模型当中if opt.end2end and opt.max_wh is None:output_axes = {'num_dets': {0: 'batch'},'det_boxes': {0: 'batch'},'det_scores': {0: 'batch'},'det_classes': {0: 'batch'},}else:output_axes = {'output': {0: 'batch'},}dynamic_axes.update(output_axes)# 是否导出 Detect() 层的栅格数目'if opt.grid:if opt.end2end:print('\nStarting export end2end onnx model for %s...' % 'TensorRT' if opt.max_wh is None else 'onnxruntime')model = End2End(model,opt.topk_all,opt.iou_thres,opt.conf_thres,opt.max_wh,device,len(labels))if opt.end2end and opt.max_wh is None:output_names = ['num_dets', 'det_boxes', 'det_scores', 'det_classes']shapes = [opt.batch_size, 1, opt.batch_size, opt.topk_all, 4,opt.batch_size, opt.topk_all, opt.batch_size, opt.topk_all]else:output_names = ['output']else:model.model[-1].concat = True# 开始导出模型torch.onnx.export(model,                       # 模型img,                        # 输入的图片f,                           # 导出文件名称verbose=False,                 # 打印日志opset_version=12, input_names=['images'],     # 输入层的名字output_names=output_names, # 输出层的名字dynamic_axes=dynamic_axes      # 动态输入输出的参数)# Checksonnx_model = onnx.load(f)  # load onnx modelonnx.checker.check_model(onnx_model)  # check onnx modelif opt.end2end and opt.max_wh is None:for i in onnx_model.graph.output:for j in i.type.tensor_type.shape.dim:j.dim_param = str(shapes.pop(0))# print(onnx.helper.printable_graph(onnx_model.graph))  # print a human readable modelif opt.simplify:try:import onnxsimprint('\nStarting to simplify ONNX...')onnx_model, check = onnxsim.simplify(onnx_model)assert check, 'assert check failed'except Exception as e:print(f'Simplifier failure: {e}')# print(onnx.helper.printable_graph(onnx_model.graph))  # print a human readable modelonnx.save(onnx_model,f)print('ONNX export success, saved as %s' % f)# 将 NMS注册为 onnx 支持的推理操作if opt.include_nms:print('Registering NMS plugin for ONNX...')mo = RegisterNMS(f)mo.register_nms()mo.save(f)except Exception as e:print('ONNX export failure: %s' % e)# Finishprint('\nExport complete (%.2fs). Visualize with https://github.com/lutzroeder/netron.' % (time.time() - t))

程序运行成功的话，我们会获得onnx模型，如果我们将 NMS 操作注册为 onnx 支持的操作，这样在转换我们的模型之后，NMS 部分的将不需再进行重写，也就是说网络输出的就是已经经过NMS操作过的。但是目前这部分功能暂时还没有完成。

4、构建期 -> 转换网络模型为 TRT 模型

一般来说，使用 TensorRT 转换模型需要以下几步：

１、设置用来读取 onnx 的相关设置，并将读取的结果进行相应格式的转换保存下来。(创建 Network（计算图内容）)
2、设置用来转化为 TRT 模型的相关设置，包括 batch size 等。（引擎构建器）
3、转化模型和输出模型。生成 SerializedNetwork (网络的 TRT 内部表示)

接下来，按照上面的划分，我们建立转换的具体步骤：

def build_engine(onnx_file_path, engine_file_path, precision_flop):"""构建期 -> 转换网络模型为 TRT 模型Args:onnx_file_path  : 要转换的 onnx 模型的路径engine_file_path: 转换之后的 TRT engine 的路径precision_flop  : 转换过程中所使用的精度Returns:转化成功: True转换失败: False"""#---------------------------------## 准备全局信息#---------------------------------## 设置转化的时候的输出的 logger 等级。trt_logger = trt.Logger(trt.Logger.INFO)# 构建一个 构建器builder = trt.Builder(trt_logger)#---------------------------------## 第一步，读取 onnx#---------------------------------## 1-1、设置网络读取的 flag# EXPLICIT_BATCH 相教于 IMPLICIT_BATCH 模式，会显示的将 batch 的维度包含在张量维度当中，# 有了 batch大小的，我们就可以进行一些必须包含 batch 大小的操作了，如 Layer Normalization。  #不然在推理阶段，应当指定推理的 batch 的大小。目前主流的使用的 EXPLICIT_BATCH 模式network_flags   = (1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))# 1-3、构建一个空的网络计算图network       = builder.create_network(network_flags)# 1-4、将空的网络计算图和相应的 logger 设置装载进一个 解析器里面parser           = trt.OnnxParser(network, trt_logger)# 1-5、打开 onnx 压缩文件，进行模型的解析工作。# 解析器 工作完成之后，网络计算图的内容为我们所解析的网络的内容。onnx_file_path   = os.path.realpath(onnx_file_path) # 将路径转换为绝对路径防止出错if not os.path.isfile(onnx_file_path):print("ONNX file not exist. Please check the onnx file path is right ? ")return Falseelse:with open(onnx_file_path, 'rb') as model:parser_flag = parser.parse(model.rade())if not parser_flag:print("ERROR: Failed to parse the onnx file {} . ".format(onnx_file_path))# 出错了，将相关错误的地方打印出来，进行可视化处理`-`for error in range(parser.num_errors):print(parser.num_errors)print(parser.get_error(error))return Falseprint("Completed parsing ONNX file . ")# 6、将转换之后的模型的输入输出的对应的大小进行打印，从而进行验证inputs  = [network.get_input(i) for i in range(network.num_inputs)]outputs = [network.get_output(i) for i in range(network.num_outputs)]print("Network Description")for input in inputs:# 获取当前转化之前的 输入的 batch_sizebatch_size = input.shape[0]print("Input '{}' with shape {} and dtype {} . ".format(input.name, input.shape, input.dtype))for output in outputs:print("Output '{}' with shape {} and dtype {} . ".format(output.name, output.shape, output.dtype))# 确保 输入的 batch_size 不为零assert batch_size > 0, "输入的 batch_size < 0, 请确定输入的参数是否满足要求. "#---------------------------------## 第二步，转换为 TRT 模型#---------------------------------## 2-1、设置 构建器 的 相关配置器# 应当丢弃老版本的 builder. 进行设置的操作config = builder.create_builder_config()# 2-2、设置 可以为 TensorRT 提供策略的策略源。如CUBLAS、CUDNN 等# 也就是在矩阵计算和内存拷贝的过程中选择不同的策略config.set_tactic_sources(1 << int(trt.TacticSource.CUBLAS))# 2-3、给出模型中任一层能使用的内存上限，这里是 2^30,为 2GB# 每一层需要多少内存系统分配多少，并不是每次都分 2 GBconfig.max_workspace_size = 2 << 30# 2-4、设置 模型的转化精度config.set_flag(trt.BuilderFlag.STRICT_TYPES)if precision_flop == "fp16":if not builder.platform_has_fast_fp16:print("FP16 is not supported natively on this platform/device . ")else:config.set_flag(trt.BuilderFlag.FP16)# 2-5，从构建器 构建引擎engine = builder.build_engine(network, config)#---------------------------------## 第三步，生成 SerializedNetwork#---------------------------------## 3-1、删除已经已经存在的版本engine_file_path    = os.path.realpath(engine_file_path) # 将路径转换为绝对路径防止出错if os.path.isfile(engine_file_path):try:os.remove(engine_file_path)except Exception:print("Cannot removing existing file: {} ".format(engine_file_path))print("Creating Tensorrt Engine: {}".format(engine_file_path))# 3-2、打开要写入的 TRT engine，利用引擎写入with open(engine_file_path, "wb") as f:f.write(engine.serialize())print("ONNX -> TRT Success。 Serialized Engine Saved at: {} . ".format(engine_file_path))return True

5、运行期 -> 利用 TRT 模型推理

在上面的过程中，我们已经成功将 onnx 模型转化为 TRT 模型，接下来的操作就是利用获得 TRT 模型进行推理。

运行期的具体步骤
1、建立 Engine
2、创建 Context
3、Buffer 相关准备 ( Host 端(cpu) + Device端(gpu)
4、Buffer拷贝操作 Host to Device
5、执行推理（Execute / Executev2）
6、Buffer拷贝操作 Device to Host
7、善后工作

其中 1、2、3步骤，在程序多次推理的过程中，只要进行一次即可，4、5、6、7，在每次推理过程中，都要进行一次。

我们将根据上述步骤，进行代码的编写：

def inference(engine_file_path, img):"""运行期 -> 利用 TRT 模型推理Args:engine_file_path: 输入的序列化之后的 TRT 模型img             : 要进行推理的图片Returns:正确推理: True错误推理: Fasle"""#---------------------------------## 1、建立 Engine # 2、创建 Context# 3、Buffer 相关准备 ( Host 端(cpu) + Device端(gpu)# 如果是连续多帧的话，上面的三项内容将只需要进行一次进行#---------------------------------## 1、建立 Engine# 1-1、建立运行时候的 logger 等级logger = trt.Logger(trt.Logger.WARNING)# 1-2、允许对序列化的 ICudaEngine 进行反序列化。# 也就是建立一个反序列化器runtime = trt.Runtime(logger)# 1-3、对ICudaEngine 进行反序列化engine_file_path    = os.path.realpath(engine_file_path) # 将路径转换为绝对路径防止出错if not os.path.isfile(engine_file_path):print("Eigine file: {} not exist! ".format(engine_file_path))return Falsewith open(engine_file_path, 'rb') as f:engine = runtime.deserialize_cuda_engine(f.read())assert engine, "反序列化之后的 engien 为空，确保转换过程的正确性 . "print("From {} load engine sucess . ".format(engine_file_path))# 2、创建 Context， 使用 ICudaEngine 执行推理的上下文。 一个 ICudaEngine 实例可能存在多个 IExecutionContext ，允许同时使用同一个 ICudaEngine 执行多个批次。context = engine.create_execution_context()assert context, "创建的上下文管理器 context 为空，请检查相应的操作"# 3、Buffer拷贝操作  Host to Device# 因为当前的数据是在 cpu 上，为了后面将数据在 gpu 上运行，我们需要，向在gpu上申请相应大小的空间# 方便后续将将内容拷贝过去# 3-1、创建三那个空序例，后续装载输入和输出inputs, outputs, bindings = [], [], []# 3-2、创建 cpu <-> gpu 内存拷贝的 cuda流stream = cuda.Stream()# 3-3、在 gpu 上申请内存for binding in engine:# 对应的输入输出内容的 个数，！！！注意是个数，不是内存的大小，size = trt.volume(engine.get_binding_shape(binding))# 内存的类型，如 int， bool。单个数据所占据的内存大小dtype = trt.nptype(engine.get_binding_dtype(binding))# 个数 * 单个内存的大小 = 内存的真实大小host_mem = cuda.pagelocked_empty(size, dtype)# 分配内存device_mem = cuda.mem_alloc(host_mem.nbytes)bindings.append(int(host_mem))# 区分输入的和输出 申请的内存if engine.binding_is_input(binding):inputs.append({'host': host_mem, 'device': device_mem})else:outputs.append({'host': host_mem, 'device': device_mem})# 3-4、接下来对输入的数据进行处理inputs[0]['host'] = np.ravel(img) # 目前数据是放在 cpu 上# 3-5、将输入的数据同步到 gpu 上面 , 从 host -> devicefor inp in inputs:cuda.memcpy_htod_async(inp['device'], inp['host'], stream)# 4、执行推理（Execute / Executev2）# execute_async_v2  ： 对批处理异步执行推理。此方法仅适用于从没有隐式批处理维度的网络构建的执行上下文。# execute_v2：      ： 在批次上同步执行推理。此方法仅适用于从没有隐式批处理维度的网络构建的执行上下文。# 同步和异步的差异    ： 在同一个上下文管理器中，程序的执行是否严格按照从上到下的过程。#                     如，连续输入多张图片，同步 会等处理完结果再去获得下一张，异步会开启多线程，提前处理数据 context.execute_async_v2(bindings=bindings, # 要进行推理的数据，放进去的时候，只有输入，出来输入、输出都有了stream_handle=stream.handle # 将在其上执行推理内核的 CUDA 流的句柄。)# 5、Buffer 拷贝操作  Device to Hostfor out in outputs:cuda.memcpy_dtoh_async(out['host'], out['device'], stream)# 将 stream 中的数据进行梳理stream.synchronize()data = [out['host'] for out in outputs]# 6、善后工作# 6-1、接下来就是对输出所获的数据进行处理就好了# 6-2、对于我们申请的内存，在程序结束之后要主动释放# c++ 中为cudaFree(),python 暂时还没找到

上面就是从 onnx 到 TRT 的大致流程，

为了代码的重复利用性，我们以类的形似将上面所有的内容进行重写，具体代码如下：


import argparse
from ast import arg, parse
from genericpath import isfile
import os
import sys
import cv2
import torch
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinitclass TRT():def __init__(self, onnx_file_path, engine_file_path, precision_flop, end2end, conf_thres, nms_thres, score_thres) -> None:self.onnx_file_path     = onnx_file_pathself.engine_file_path   = engine_file_pathself.precision_flop     = precision_flopself.end2end            = end2endself.inputs             = []self.outputs            = []self.bindings           = []self.img_size           = (640, 640)self.conf_threshold     = conf_thresself.nms_threshold      = nms_thresself.score_threshold    = score_thresself.COCO ={"label":[ 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light','fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow','elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee','skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle','wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush' ],"color":[[244, 67, 54], [233, 30, 99], [156, 39, 176], [103, 58, 183], [100, 30, 60], [63, 81, 181], [33, 150, 243], [3, 169, 244], [0, 188, 212],[20, 55, 200],[0, 150, 136], [76, 175, 80], [139, 195, 74], [205, 220, 57], [70, 25, 100], [255, 235, 59], [255, 193, 7], [255, 152, 0], [255, 87, 34], [90, 155, 50],[121, 85, 72], [158, 158, 158], [96, 125, 139], [15, 67, 34], [98, 55, 20], [21, 82, 172], [58, 128, 255], [196, 125, 39], [75, 27, 134], [90, 125, 120],[121, 82, 7], [158, 58, 8], [96, 25, 9], [115, 7, 234], [8, 155, 220], [221, 25, 72], [188, 58, 158], [56, 175, 19], [215, 67, 64], [198, 75, 20],[62, 185, 22], [108, 70, 58], [160, 225, 39], [95, 60, 144], [78, 155, 120], [101, 25, 142], [48, 198, 28], [96, 225, 200], [150, 167, 134], [18, 185, 90],[21, 145, 172], [98, 68, 78], [196, 105, 19], [215, 67, 84], [130, 115, 170], [255, 0, 255], [255, 255, 0], [196, 185, 10], [95, 167, 234],[18, 25, 190],[0, 255, 255], [255, 0, 0], [0, 255, 0], [0, 0, 255], [155, 0, 0], [0, 155, 0], [0, 0, 155], [46, 22, 130], [255, 0, 155], [155, 0, 255], [255, 155, 0],[155, 255, 0], [0, 155, 255], [0, 255, 155], [18, 5, 40], [120, 120, 255], [255, 58, 30], [60, 45, 60], [75, 27, 244], [128, 25, 70]]}self.Init_model()def Init_model(self):"""加载 TRT 模型, 并加载一些多次推理过程共用的参数。情况 1、TRT 模型不存在，会先从输入的 onnx 模型创建一个 TRT 模型，并保存，再进行推导；情况 2、TRT 模型存在，直接进行推导"""# 1、加载 logger 等级self.logger = trt.Logger(trt.Logger.WARNING)# 2、加载 TRT 模型if os.path.isfile(self.engine_file_path):self.engine = self.readTrtFile(self.engine_file_path)assert self.engine, "从 TRT 文件中读取的 engine 为 None ! "else:self.engine = self.onnxToTRTModel(self.onnx_file_path, self.engine_file_path, self.precision_flop)assert self.engine, "从 onnx 文件中转换的 engine 为 None ! "# 3、创建上下管理器，后面进行推导使用self.context = self.engine.create_execution_context()assert self.context, "创建的上下文管理器 context 为空，请检查相应的操作"# 4、创建数据传输流，在 cpu <--> gpu 之间传输数据的时候使用。self.stream = cuda.Stream()# 5、在 cpu 和 gpu 上申请内存for binding in self.engine:# 对应的输入输出内容的 个数，！！！注意是个数，不是内存的大小，size = trt.volume(self.engine.get_binding_shape(binding))# 内存的类型，如 int， bool。单个数据所占据的内存大小dtype = trt.nptype(self.engine.get_binding_dtype(binding))# 个数 * 单个内存的大小 = 内存的真实大小，先申请 cpu 上的内存host_mem = cuda.pagelocked_empty(size, dtype)# 分配 gpu 上的内存device_mem = cuda.mem_alloc(host_mem.nbytes)self.bindings.append(int(device_mem))print("size: {}, dtype: {}, device_mem: {}".format(size, dtype, device_mem))# 区分输入的和输出 申请的内存if self.engine.binding_is_input(binding):self.inputs.append({'host': host_mem, 'device': device_mem})else:self.outputs.append({'host': host_mem, 'device': device_mem})def inference(self, img_path, mode="video"):"""根据包不同的模式，对输入的路径进行推理Args:img_path: 输入的图片路径mode    : 要进行处理的模式. 默认为, "video". choice = ["video", "img"]."""img_path = os.path.realpath(img_path)if mode == "video":cap = cv2.VideoCapture(img_path)ret, frame = cap.read()if not ret:print("视频读取出错，请检查错误. 当前输入路径为: {}. ".format(img_path))sys.exit(-1)while ret:ret, frame = cap.read()img, ratio = self.prepareImage(frame)engine_infer_output = self.infer_single_img(img)final_img = self.post_process(engine_infer_output, frame, ratio)cv2.imshow("TRT inference result", final_img)if cv2.waitKey(1) == ord('q') or cv2.waitKey(1) == 27 : # 27 对应 Escbreakcap.release()cv2.destroyAllWindows()else:if not os.path.isfile(img_path):print("输入单张图片的路径出错，请检查相应的路径：{}".format(img_path))sys.exit(-1)frame = cv2.imread(img_path)img, ratio = self.prepareImage(frame)engine_infer_output = self.infer_single_img(img)final_img = self.post_process(engine_infer_output, frame, ratio)cv2.imshow("TRT inference result", final_img)if cv2.waitKey(-1) == ord('q') or cv2.waitKey(-1) == 27 :cv2.destroyAllWindows()def infer_single_img(self, img):"""对单张图片进行推理Args:img: 输入的图片Returns:返回 trt 推理的结果"""# 1、对输入的数据进行处理self.inputs[0]['host'] = np.ravel(img) # 目前数据是放在 cpu 上# 2、将输入的数据同步到 gpu 上面 , 从 host -> devicefor inp in self.inputs:cuda.memcpy_htod_async(inp['device'], inp['host'], self.stream)# 3、执行推理（Execute / Executev2）# execute_async_v2  ： 对批处理异步执行推理。此方法仅适用于从没有隐式批处理维度的网络构建的执行上下文。# execute_v2：      ： 在批次上同步执行推理。此方法仅适用于从没有隐式批处理维度的网络构建的执行上下文。# 同步和异步的差异    ： 在同一个上下文管理器中，程序的执行是否严格按照从上到下的过程。#                     如，连续输入多张图片，同步 会等处理完结果再去获得下一张，异步会开启多线程，提前处理数据 self.context.execute_async_v2(bindings=self.bindings, # 要进行推理的数据，放进去的时候，只有输入，出来输入、输出都有了stream_handle=self.stream.handle # 将在其上执行推理内核的 CUDA 流的句柄。)# 4、Buffer 拷贝操作 Device to Hostfor out in self.outputs:cuda.memcpy_dtoh_async(out['host'], out['device'], self.stream)# 5、将 stream 中的数据进行梳理self.stream.synchronize()# 6、整理输出engine_infer_output = []for out in self.outputs:out['host'] = np.reshape(out['host'], (-1, 85))engine_infer_output.append(out['host'])engine_infer_output = np.concatenate(engine_infer_output, 0) return engine_infer_outputdef result_visual(self, img, boxes, scores, cls_ids, classes_and_colors):"""对得到的结果进行可视化Args:img                 : 原始输入的图片boxes               : 最终的检测框scores              : 最终检测框的得分cls_ids             : 最终检测框对应的类别序号classes_and_colors  : coco数据集类别和颜色Returns:_description_"""for i in range(len(boxes)):box = boxes[i]cls_id = int(cls_ids[i])score = scores[i]if score < self.conf_threshold:continuex0 = int(box[0])y0 = int(box[1])x1 = int(box[2])y1 = int(box[3])color = (classes_and_colors["color"][cls_id])text = '{}:{:.1f}%'.format(classes_and_colors["label"][cls_id], score * 100)font = cv2.FONT_HERSHEY_SIMPLEXtxt_size = cv2.getTextSize(text, font, 0.6, 2)[0]cv2.rectangle(img, (x0, y0), (x1, y1), color, 2)cv2.rectangle(img, (x0, y0 + 1), (x0 + txt_size[0] + 1, y0 + int(1.5 * txt_size[1])), color, 1)cv2.putText(img, text, (x0, y0 + txt_size[1]), font, 0.6, color, thickness=2)return img# 前处理def prepareImage(self, org_img):"""对输入的图片进行预处理, 包括 正则化, 不改变宽高比的resize, 还有改变通道顺序Args:org_img         : 原始的读取的图片Returns:返回处理好的图片，并返回改变率"""netinput_size = self.img_sizeif len(org_img.shape) == 3:padded_img = np.ones((netinput_size[0], netinput_size[1], 3)) * 114.0else:padded_img = np.ones(netinput_size) * 114.0img = np.array(org_img)ratio = min(netinput_size[0] / img.shape[0], netinput_size[1] / img.shape[1])resized_img = cv2.resize(img, (int(img.shape[1] * ratio), int(img.shape[0] * ratio)), interpolation=cv2.INTER_LINEAR,).astype(np.float32)padded_img[: int(img.shape[0] * ratio), : int(img.shape[1] * ratio)] = resized_imgpadded_img = padded_img[:, :, ::-1]padded_img /= 255.0padded_img = padded_img.transpose((2, 0 ,1))padded_img = np.ascontiguousarray(padded_img, dtype=np.float32)return padded_img, ratio# 如果在非 end2end 的情况下的时候, 我们需要对当输出的结果进行 NMSdef post_process(self, engine_infer_output, origin_img, ratio):"""对网络输出的结果进行后处理Args:engine_infer_output : 网络输出的结果，-> ( 25200, 85)origin_img          : 送入网络之前的原始图片ratio               : 原始图片的大小 / 送入网络的图片大小Returns:最终绘制完层检测框的图片"""# 再没有进行非极大值抑制的情况下，原始网络输出为 25200*85 = （ 20 * 20 + 40 *40 + 80 * 80） * 85 * 3 (三个输出头)if self.end2end :num, final_boxes, final_scores, final_cls_inds = engine_infer_outputfinal_boxes = np.reshape(final_boxes/ratio, (-1, 4))dets = np.concatenate([final_boxes[:num[0]], np.array(final_scores)[:num[0]].reshape(-1, 1), np.array(final_cls_inds)[:num[0]].reshape(-1, 1)], axis=-1)else:dets = self.non_max_suppression(engine_infer_output, ratio)if dets is not None:final_boxes, final_scores, final_cls_inds = dets[:,:4], dets[:, 4], dets[:, 5]origin_img = self.result_visual(origin_img, final_boxes, final_scores, final_cls_inds, self.COCO)return origin_imgdef non_max_suppression(self, prediction, ratio, num_classes = 80):"""对检测头输出的多个检测框, 进行非极大值抑制Args:prediction  : 检测头输出的全部的检测框，具体的维度信息为 (25200, 85)ratio       : 输入图片被缩放的系数Returns:输出每张图片上进行过非极大值抑制的结果，最终的维度为：(n, 6); 6 -> [xyxy, conf, cls]"""# 0、最终的输出结果boxes_after_nms = []# 1、先去除一些置信度比较低的mask = np.squeeze(prediction[..., 4:5] > self.score_threshold)prediction = prediction[mask]# 2、得到每个检测框的的得分数，-> box_scores = obj_conf * cls_confscores               = prediction[:, 4:5] * prediction[:, 5:]# 3、转换 (center x, center y, width, height) to (x1, y1, x2, y2), 并转换为适应图片的大小boxes                = self.xywh2xyxy(prediction[:, :4]) / ratio # 4、按照不同的 类别 进行 nmsfor class_i in range(num_classes):cls_scores = scores[:, class_i]cls_score_mask = cls_scores > self.score_thresholdif cls_score_mask.sum() == 0:continueelse:cls_scores = cls_scores[cls_score_mask]cls_boxes = boxes[cls_score_mask]keep = self.nms(cls_boxes, cls_scores, 0.45)if len(keep) > 0:cls_inds = np.ones((len(keep), 1)) * class_idets = np.concatenate([cls_boxes[keep], cls_scores[keep, None], cls_inds], 1)boxes_after_nms.append(dets)if len(boxes_after_nms) == 0:return Nonereturn np.concatenate(boxes_after_nms, 0)def nms(self,boxes, scores, nms_thr):"""Single class NMS implemented in Numpy."""x1 = boxes[:, 0]y1 = boxes[:, 1]x2 = boxes[:, 2]y2 = boxes[:, 3]areas = (x2 - x1 + 1) * (y2 - y1 + 1)order = scores.argsort()[::-1]keep = []while order.size > 0:i = order[0]keep.append(i)xx1 = np.maximum(x1[i], x1[order[1:]])yy1 = np.maximum(y1[i], y1[order[1:]])xx2 = np.minimum(x2[i], x2[order[1:]])yy2 = np.minimum(y2[i], y2[order[1:]])w = np.maximum(0.0, xx2 - xx1 + 1)h = np.maximum(0.0, yy2 - yy1 + 1)inter = w * hovr = inter / (areas[i] + areas[order[1:]] - inter)inds = np.where(ovr <= nms_thr)[0]order = order[inds + 1]return keepdef xywh2xyxy(self, x):# Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-righty = x.clone() if isinstance(x, torch.Tensor) else np.copy(x)y[:, 0] = x[:, 0] - x[:, 2] / 2  # top left xy[:, 1] = x[:, 1] - x[:, 3] / 2  # top left yy[:, 2] = x[:, 0] + x[:, 2] / 2  # bottom right xy[:, 3] = x[:, 1] + x[:, 3] / 2  # bottom right yreturn ydef box_iou(self, boxes_preds, boxes_labels, box_format="midpoint"):"""计算两个框之间的面积的交并比(iou)的函数Parameters:boxes_preds (tensor) : 网络预测出来的框的坐标 (BATCH_SIZE, 4)boxes_labels (tensor): 真实标签下的框的坐标   (BATCH_SIZE, 4)box_format (str)     : 选择自己的模式, midpoint/corners, if boxes (x,y,w,h) or (x1,y1,x2,y2)Returns:tensor: 返回检测框之间的 iou"""if box_format == "midpoint":box1_x1 = boxes_preds[..., 0:1]  - boxes_preds[..., 2:3] / 2box1_y1 = boxes_preds[..., 1:2]  - boxes_preds[..., 3:4] / 2box1_x2 = boxes_preds[..., 0:1]  + boxes_preds[..., 2:3] / 2box1_y2 = boxes_preds[..., 1:2]  + boxes_preds[..., 3:4] / 2box2_x1 = boxes_labels[..., 0:1] - boxes_labels[..., 2:3] / 2box2_y1 = boxes_labels[..., 1:2] - boxes_labels[..., 3:4] / 2box2_x2 = boxes_labels[..., 0:1] + boxes_labels[..., 2:3] / 2box2_y2 = boxes_labels[..., 1:2] + boxes_labels[..., 3:4] / 2if box_format == "corners":box1_x1 = boxes_preds[..., 0:1]box1_y1 = boxes_preds[..., 1:2]box1_x2 = boxes_preds[..., 2:3]box1_y2 = boxes_preds[..., 3:4]box2_x1 = boxes_labels[..., 0:1]box2_y1 = boxes_labels[..., 1:2]box2_x2 = boxes_labels[..., 2:3]box2_y2 = boxes_labels[..., 3:4]x1 = max(box1_x1, box2_x1)y1 = max(box1_y1, box2_y1)x2 = min(box1_x2, box2_x2)y2 = min(box1_y2, box2_y2)# 确保交集的框的宽高不会是负数intersection = max(0, (x2 - x1)) * max(0, (y2 - y1))box1_area = abs((box1_x2 - box1_x1) * (box1_y2 - box1_y1))box2_area = abs((box2_x2 - box2_x1) * (box2_y2 - box2_y1))return intersection / (box1_area + box2_area - intersection + 1e-6)def readTrtFile(self, engine_file_path):"""从已经存在的文件中读取 TRT 模型Args:engine_file_path: 已经存在的 TRT 模型的路径Returns:加载完成的 engine"""engine_file_path = os.path.realpath(engine_file_path)print("Loading TRT fil from : {}".format(engine_file_path))runtime = trt.Runtime(self.logger)with open(engine_file_path, 'rb') as f:engine = runtime.deserialize_cuda_engine(f.read())assert engine, "反序列化之后的 engien 为空，确保转换过程的正确性 . "print("From {} load engine sucess . ".format(engine_file_path))return enginedef onnxToTRTModel(self, onnx_file_path, engine_file_path, precision_flop):"""构建期 -> 转换网络模型为 TRT 模型Args:onnx_file_path  : 要转换的 onnx 模型的路径engine_file_path: 转换之后的 TRT engine 的路径precision_flop  : 转换过程中所使用的精度Returns:转化成功: engine转换失败: None"""#---------------------------------## 准备全局信息#---------------------------------## 构建一个 构建器builder = trt.Builder(self.logger)builder.max_batch_size = 1#---------------------------------## 第一步，读取 onnx#---------------------------------## 1-1、设置网络读取的 flag# EXPLICIT_BATCH 相教于 IMPLICIT_BATCH 模式，会显示的将 batch 的维度包含在张量维度当中，# 有了 batch大小的，我们就可以进行一些必须包含 batch 大小的操作了，如 Layer Normalization。  #不然在推理阶段，应当指定推理的 batch 的大小。目前主流的使用的 EXPLICIT_BATCH 模式network_flags    = (1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))# 1-3、构建一个空的网络计算图network       = builder.create_network(network_flags)# 1-4、将空的网络计算图和相应的 logger 设置装载进一个 解析器里面parser           = trt.OnnxParser(network, self.logger)# 1-5、打开 onnx 压缩文件，进行模型的解析工作。# 解析器 工作完成之后，网络计算图的内容为我们所解析的网络的内容。onnx_file_path  = os.path.realpath(onnx_file_path) # 将路径转换为绝对路径防止出错if not os.path.isfile(onnx_file_path):print("ONNX file not exist. Please check the onnx file path is right ? ")return Noneelse:with open(onnx_file_path, 'rb') as model:if not parser.parse(model.read()):print("ERROR: Failed to parse the onnx file {} . ".format(onnx_file_path))# 出错了，将相关错误的地方打印出来，进行可视化处理`-`for error in range(parser.num_errors):print(parser.num_errors)print(parser.get_error(error))return Noneprint("Completed parsing ONNX file . ")# 6、将转换之后的模型的输入输出的对应的大小进行打印，从而进行验证for i in range(network.num_outputs):print(i)inputs  = [network.get_input(i) for i in range(network.num_inputs)]outputs = [network.get_output(i) for i in range(network.num_outputs)]print("Network Description")batch_size = 0for inp in inputs:# 获取当前转化之前的 输入的 batch_sizebatch_size = inp.shape[0]print("Input '{}' with shape {} and dtype {} . ".format(inp.name, inp.shape, inp.dtype))for outp in outputs:print("Output '{}' with shape {} and dtype {} . ".format(outp.name, outp.shape, outp.dtype))# 确保 输入的 batch_size 不为零assert batch_size > 0, "输入的 batch_size < 0, 请确定输入的参数是否满足要求. "#---------------------------------## 第二步，转换为 TRT 模型#---------------------------------## 2-1、设置 构建器 的 相关配置器# 应当丢弃老版本的 builder. 进行设置的操作config = builder.create_builder_config()# 2-2、设置 可以为 TensorRT 提供策略的策略源。如CUBLAS、CUDNN 等# 也就是在矩阵计算和内存拷贝的过程中选择不同的策略# config.set_tactic_sources(1 << int(trt.TacticSource.CUBLAS))# 2-3、给出模型中任一层能使用的内存上限，这里是 2^30,为 2GB# 每一层需要多少内存系统分配多少，并不是每次都分 2 GBconfig.max_workspace_size = 1 << 30# 2-4、设置 模型的转化精度config.set_flag(trt.BuilderFlag.FP16)# 2-5，从构建器 构建引擎engine = builder.build_engine(network, config)#---------------------------------## 第三步，生成 SerializedNetwork#---------------------------------## 3-1、删除已经已经存在的版本engine_file_path     = os.path.realpath(engine_file_path) # 将路径转换为绝对路径防止出错if os.path.isfile(engine_file_path):try:os.remove(engine_file_path)except Exception:print("Cannot removing existing file: {} ".format(engine_file_path))print("Creating Tensorrt Engine: {}".format(engine_file_path))# 3-2、打开要写入的 TRT engine，利用引擎写入with open(engine_file_path, "wb") as f:f.write(engine.serialize())print("ONNX -> TRT Success。 Serialized Engine Saved at: {} . ".format(engine_file_path))return engineif __name__ == "__main__":parser = argparse.ArgumentParser()parser.add_argument("-o", "--onnx", help="Input onnx model path. ")parser.add_argument("-e", "--engine", help="Output TRT model path. ")parser.add_argument("-p", "--precision", default="fp16", choices=["fp32", "fp16", "int8"], help="The precision mode to build in, either 'fp32', 'fp16' or 'int8', default: 'fp16'")parser.add_argument("--end2end", default=False, action="store_true",help="export the engine include nms plugin, default: False")parser.add_argument("--conf_thres", default=0.4, type=float,help="The conf threshold for the nms, default: 0.5")parser.add_argument("--iou_thres", default=0.5, type=float,help="The iou threshold for the nms, default: 0.45")parser.add_argument("--scores_thres", default=0.25, type=float,help="The scores threshold for the nms, default: 0.25")parser.add_argument("-i", "--img_path", default="python/src/video1.mp4")parser.add_argument("-m", "--mode", default="video")args = parser.parse_args()print(args)if not all([args.onnx, args.engine]):parser.print_help()print("These arguments are required: --onnx and --engine")sys.exit(1)trt_model = TRT(args.onnx, args.engine, args.precision, args.end2end, args.conf_thres, args.iou_thres, args.scores_thres)if args.img_path:trt_model.inference(args.img_path, mode=args.mode)

本文代码地址：https://github.com/chongchongchongya/Onnx-TensorRT-Python

未完待续。。。

1、目前仅仅完成了不包含 NMS 模块的 TRT 模型的转化，后续的话，会加上 NMS 的 plugin，使得 TRT 可以支持 NMS 操作的加速。
2、将 C++ 实现同样的代码。
3、实现多种代码，如语义分割等。

TensorRT学习第一篇：python 中 TensorRT 使用流程之onnx相关推荐

Python编程学习第一篇——Python零基础快速入门（三）——10行代码画朵花
上一节讲了一些Python编程的一些基础知识,从这节开始,我们将跟随一些实际的小程序示例,进入正式的编程学习. 下面我们就来介绍一下今天这个只有10行代码的小程序,先来看一下它的运行效果, ...
Python中的TCP的客户端UDP学习----第一篇博客
Python中的TCP的客户端&UDP学习--第一篇博客 PS: 每日的怼人句子"我真想把我的脑子放到你的身体里,让你感受一下智慧的光芒" 先说UDP流程发送: 创建套接 ...
MongoDB学习第一篇 --- Mac下使用HomeBrew安装MongoDB
2019独角兽企业重金招聘Python工程师标准>>> MongoDB学习第一篇 --- Mac下使用HomeBrew安装MongoDB 0.确保mac已经安装了HomeBrew ( ...
VUE源码学习第一篇--前言
一.目的前端技术的发展,现在以vue,react,angular为代表的MVVM模式以成为主流,这三个框架大有三分天下之势.react和angular有facebook与谷歌背书,而vue是以一己之 ...
区块链研习 | 区块链里所说的“智能合约”是什么？本文作者：敖萌编辑：温晓桦 2017-10-11 20:31 导语：谈到区块链，必然离不开“智能合约”这个词。我们在本系列的第一篇文章中提到“智能
区块链研习 | 区块链里所说的"智能合约"是什么? 本文作者:敖萌编辑:温晓桦 2017-10-11 20:31 导语:谈到区块链,必然离不开"智能合约"这个 ...
Python语言学习：在python中，如何获取变量的本身字符串名字而非其值/内容及其应用(在代码中如何查找同值的所有变量名)
Python语言学习:在python中,如何获取变量的本身字符串名字而非其值/内容及其应用(在代码中如何查找同值的所有变量名) 目录
Python中的网络编程之TCP
Python中的网络编程之TCP 文章目录 Python中的网络编程之TCP 1.TCP介绍 2.TCP特点 3.TCP与UDP的不同点 4.tcp通信模型 5.tcp客户端 6.tcp服务器 7.T ...
Python中的网络编程之UDP
Python中的网络编程之UDP 文章目录 Python中的网络编程之UDP 一.Socket编程 `1.什么是客户端/服务器架构`? **`2.套接字:通信端点`** 3.套接字地址:主机-端口对 ...
python集合类型的操作符_Python 语言学习第一篇：数据类型（数字，集合，布尔类型，操作符）...
Python语言最常用的对象是变量和常量,常量的值是字面意思,变量的值是可变的,例如,123,"上海"是常量,而a=1,a=2,其中a是变量名.内置的核心数据类型有:数字.字符串. ...

TensorRT学习第一篇：python 中 TensorRT 使用流程之onnx