基于tensorrt的模型开发

  • 基于tensorrt的模型开发:
  • 目标识别及自动驾驶等工业产品应用的感知架构结构:

基于tensorrt的模型开发:

深度学习模型研发的生命周期包括五步:目标确认、任务建模与pipeline搭建、数据采集与标注、模型训练、模型部署。模型部署是非常重要的一环,它是模型落地前的临门一脚;95%的公司训练出模型,却停在了部署阶段。

模型部署阶段对模型推理有五个要求:

Throughput
The volume of output within a given period. Often measured in inferences/second or samples/second, per-server throughput is critical to cost-effective scaling in data centers.
Efficiency
Amount of throughput delivered per unit-power, often expressed as performance/watt. Efficiency is another key factor to cost-effective data center scaling, since servers, server racks, and entire data centers must operate within fixed power budgets.
Latency
Time to execute an inference, usually measured in milliseconds. Low latency is critical to delivering rapidly growing, real-time inference-based services.
Accuracy
A trained neural network’s ability to deliver the correct answer. For image classification based usages, the critical metric is expressed as a top-5 or top-1 percentage.
Memory Usage
The host and device memory that need to be reserved to do inference on a network depend on the algorithms used. This constrains what networks and what combinations of networks can run on a given inference platform. This is particularly important for systems where multiple networks are needed and memory resources are limited - such as cascading multi-class detection networks used in intelligent video analytics and multi-camera, multi-network autonomous driving systems.

如果我们部署的硬件是英伟达的产品,以上五点都可以通过TensorRT进行优化,TensorRT有如下五个功能:

Quantization
Most deep learning frameworks train neural networks in full 32-bit precision (FP32). Once the model is fully trained, inference computations can use half precision FP16 or even INT8 tensor operations, since gradient backpropagation is not required for inference. Using lower precision results in smaller model size, lower memory utilization and latency, and higher throughput.
Kernel Auto Tuning
During the optimization phase TensorRT also chooses from hundreds of specialized kernels, many of them hand-tuned and optimized for a range of parameters and target platforms. As an example, there are several different algorithms to do convolutions. TensorRT will pick the implementation from a library of kernels that delivers the best performance for thetarget GPU, input data size, filter size, tensor layout, batch size and other parameters.This ensures that the deployed model is performance tuned for the specific deployment platform as well as for the specific neural network being deployed.
Elimination of Redundant Layers and Operations
layers whose outputs are not used and operations which are equivalent to no-op.

Figure1. TensorRT’s vertical and horizontal layer fusion and layer elimination optimizations simplify the GoogLeNet Inception module graph, reducing computation and memory overhead.
Layer & Tensor Fusion
TensorRT parses the network computational graph and looks for opportunities to perform graph optimizations. These graph optimizations do not change the underlying computation in the graph: instead, they look to restructure the graph to perform the operations much faster and more efficiently.
When a deep learning framework executes this graph during inference, it makes multiple function calls for each layer. Since each operation is performed on the GPU, this translates to multiple CUDA kernel launches. The kernel computation is often very fast relative to the kernel launch overhead and the cost of reading and writing the tensor data for each layer. This results in a memory bandwidth bottleneck and underutilization of available GPU resources.
TensorRT addresses this by vertically fusing kernels to perform the sequential operations together. This layer fusion reduces kernel launches and avoids writing into and reading from memory between layers. In network on the left of Figure 1, the convolution, bias and ReLU layers of various sizes can be combined into a single kernel called CBR as the right side of Figure 1 shows. A simple analogy is making three separate trips to the supermarket to buy three items versus buying all three in a single trip.
TensorRT also recognizes layers that share the same input data and filter size, but have different weights. Instead of three separate kernels, TensorRT fuses them horizontally into a single wider kernel as shown for the 1×1 CBR layer in the right side of Figure 1.
TensorRT can also eliminate the concatenation layers in Figure 1 (“concat”) by preallocating output buffers and writing into them in a strided fashion.
Overall the result is a smaller, faster and more efficient graph with fewer layers and kernel launches, therefore reducing inference latency.
Dynamic Tensor Memory
TensorRT also reduces memory footprint and improves memory reuse by designating memory for each tensor only for the duration of its usage, avoiding memory allocation overhead for fast and efficient execution.

TensorRT使用过程分为两步:

Build
The build phase needs to be run on the target deployment GPU platform. For example, if your application is going to run on a Jetson TX2, the build needs to be performed on a Jetson TX2, and likewise if your inference services will run in the cloud on AWS P3 instances with Tesla V100 GPUs, then the build phase needs to run on a system with a Tesla V100. This step is only performed once, so typical applications build one or many engines once, and then serialize them for later use.
We use TensorRT to parse a trained model and perform optimizations for specified parameters such as batch size, precision, and workspace memory for the target deployment GPU. The output of this step is an optimized inference execution engine which we serialize a file on disk called a plan file.
A plan file includes serialized data that the runtime engine uses to execute the network. It’s called a plan file because it includes not only the weights, but also the schedule for the kernels to execute the network. It also includes information about the network that the application can query in order to determine how to bind input and output buffers.
Deploy
This is the deployment step. We load and deserialize a saved plan file to create a TensorRT engine object, and use it to run inference on new data on the target deployment platform.

目标识别及自动驾驶等工业产品应用的感知架构结构:

通过将深度神经网络和其他复杂的处理任务引入到流处理管道中,以实现对视频和其他传感器数据的近实时分析。从这些传感器中提取有意义的见解为提高运营效率和安全性创造了机会。例如,摄像头是当前使用最多的目标识别。在我们的家中,街道上,停车场,大型购物中心,仓库,工厂中都可以找到相机–无处不在。视频分析的潜在用途是巨大的:自动驾驶,访问控制,防止丢失,自动结帐,监视,安全,自动检查(QA),包裹分类(智能物流),交通控制/工程,工业自动化等。


通用流分析架构定义了可扩展的视频处理管道,可用于执行推理,对象跟踪和报告,甚至存储到云端。当应用程序分析每个视频帧时,插件会提取信息并将其存储为级联元数据记录的一部分,从而保持记录与源帧的关联。管道末尾的完整元数据集合表示深度学习模型和其他分析插件从框架中提取的完整信息集。应用程序可以使用此信息进行显示,也可以作为消息的一部分在外部进行传输,以进行进一步的分析或长期归档。

对元数据使用可扩展的标准结构。基本的元数据结构NvDsBatchMeta始于在所需Gst-nvstreammux插件内部创建的批处理级元数据。辅助元数据结构保存框架,对象,分类器和标签数据。

NVIDIA边缘计算产品项目落地,tensorrt介绍以及应用相关推荐

  1. 云原生新边界——阿里云边缘计算云原生落地实践

    简介: 日前,在由全球分布式云联盟主办的"Distributed Cloud | 2021 全球分布式云大会·云原生论坛"上,阿里云高级技术专家黄玉奇发表了题为<云原生新边界 ...

  2. 2020十大边缘计算开源项目

    2020 年是非常特别的一年,各行各业都非常艰难,但是有一股力量逆势增长,不断迎来新的突破,那就是开源.即使是资本寒冬,一些开源公司如EMQ也能逆势融资,Rancher也能和SUSE强强联合,开源充满 ...

  3. 阿里云IoT边缘计算产品Link Edge公测,打造云边端一体化计算平台

    近日,阿里云IoT边缘计算产品Link Edge正式公测.Link Edge将阿里云的计算能力扩展至边缘,在离设备最近的位置执行计算,提供安全可靠.低延时.低成本.易扩展的本地计算服务,能够大大提升处 ...

  4. 5G时代下,边缘计算产品的未来展望

    来源:北京物联网智能技术应用协会 首发于人人都是产品经理. 一.边缘计算市场潜力 1. 5G基建如火如荼,万物互联互融将成为新战场 2019年6月6号上午工信部宣布正式为中国移动.中国联通.中国电信. ...

  5. 九州云获“边缘计算优秀方案奖”|2021年度边缘计算优秀项目评选揭晓

    ​为推进边缘计算技术创新.促进边缘计算在行业领域应用部署,2021年末,中国信通院联合算网融合产业及标准推进委员会(CCSA TC621)正式发起"2021年度边缘计算优秀评选"活 ...

  6. 区块链与边缘计算(2)功能介绍

    1. 区块链+工业边缘计算场景 设备接入.协议解析.边缘数据处理标准化需求: 边缘层依托区块链分布式计算协调,多方安全计算和联邦学习能力能提供给机理模型联合训练机制,并通过贡献度量提供模型激励体系: ...

  7. 边缘计算服务器项目,什么是边缘计算?三分钟看懂

    什么是边缘计算呢?其实关于边缘计算的定义,目前国内还没有一个严格统一的定义.今天小编就给大家整理三个关于边缘计算的说法,帮助大家快速理解什么是边缘计算. 1.维基百科说 维基百科上说,边缘计算是一种分 ...

  8. 互联网1分钟 | 0110 腾讯联手拳头游戏成立腾竞体育;百度智能云发布中国首款智能边缘计算产品BIE...

    戳戳戳戳一下"好看",你最好看

  9. 华为首席开源联络官执笔,带你了解5G时代的边缘计算

    5G将成为未来社会发展的重大需求,成为构建未来智慧社会的核心基础.对5G网络的应用和互联网的应用的一个共同特点是,现在都已进入消费领域与实体经济深度融合的过程.4G也好,互联网也好,都在消费领域发挥了 ...

最新文章

  1. JDK5.0新特性系列---目录
  2. mysql 1100字符后没了_mysql的字符串函数
  3. 2018-1-20:左移和右移运算以及和(无符号右移)的区别以及位运算的小题目...
  4. 【数学基础】算法工程师必备的机器学习--线性模型(上)
  5. win10下安装pyspark及碰到的问题
  6. amazon 设计 4 vending machine
  7. 官网的python下载安装后无需配置即可使用_图文解说少儿编程软件python下载安装,没有经验的新手也可以做到...
  8. 华为手机如何给应用加锁_如何设置华为手机应用程序锁?
  9. 【在野利用】Apple Mail多个严重漏洞在野利用通告
  10. 【Unity】Sprite Atlas功能讲解
  11. python画条状图_Python 绘制分组条形图
  12. db2 hadr 起停顺序_电机顺序启停问题
  13. C++计算时域数据的特征值(数据处理eigen)
  14. vue 前端中如何改变图标大小
  15. 【ubuntu拷贝目录】cp: omitting directory”错误的解释和解决办法
  16. SpringBoot模拟单点登录
  17. 爬虫 某全国律师执业诚信信息公示平台 DES加密+中文字体反爬
  18. android端蓝牙控制单片机,android 蓝牙控制继电器——单片机控制端
  19. 身份证验证信息15位或18位
  20. 解决桌游模拟器Tabletop Simulator(TTS)的卡顿、高延迟999ms的方法

热门文章

  1. 北京大学肖臻老师《区块链技术与应用》公开课笔记15——ETH概述篇
  2. 什么是Web Service:/什么是Web服务
  3. 拉丁字母(英语)缩写
  4. cutterman为什么导出是html,cutterman怎么用 cutterman插件使用方法
  5. smobiler仿自如app筛选页面
  6. python计算平方根保留两位小数_python 使用二分法计算平方根
  7. 5个Excel快速查找,解决删除重复值技巧
  8. 下载素材资源的网站有 哪些?
  9. 开源!手把手教你搭建Arduino+英伟达Jetson的ROS小车(上)
  10. QT Libvlc抽帧对视频旋转90度显示以及轨道信息的获取