导读

在使用tensorflow训练模型的时候报如下错误

tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node AllReduceGrads/NcclAllReduce (defined at /home/zw/anaconda3/envs/tf_models/lib/python3.7/site-packages/tensorpack/graph_builder/utils.py:160) with these attrs: [reduction="sum", shared_name="c0", T=DT_FLOAT, num_devices=2]
Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels:device='GPU'[[AllReduceGrads/NcclAllReduce]]Errors may have originated from an input operation.
Input Source operations connected to node AllReduceGrads/NcclAllReduce:tower0/gradients/AddN_373 (defined at /home/zw/anaconda3/envs/tf_models/lib/python3.7/site-packages/tensorpack/train/tower.py:276)
terminate called without an active exception
terminate called recursively
terminate called recursively
*** Received signal 6 ***
*** BEGIN MANGLED STACK TRACE ***
Aborted (core dumped)

这个错误是发生在使用多个GPU进行并行训练的时候,使用单个GPU训练的时候并没有报错,而且指定的GPU会占用135M的GPU内存。

环境

  • 系统:Ubuntu16.04
  • cuda版本:10.1
  • cudnn版本:8.0.2
  • tensorflow-gpu:1.14.0

错误原因分析及解决办法

其实这个错误主要是因为环境配置问题导致,在训练的时候报如上错误的时候,在查找上面的输出信息的前面发现如下信息

2020-08-14 13:58:07.324004: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324109: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324205: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324311: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324415: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324508: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324599: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2020-08-14 13:58:07.324614: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2020-08-14 13:58:07.324666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:

通过分析上面的错误可以发现,是由于找不到libcu*.so.10.0导致的,所以可以很肯定这个错误是由于cuda的版本导致的。因为我安装的是cuda10.1的版本,而TensorFlow1.14需要的是cuda10.0的版本,所以针对这种情况,要么更换cuda的版本要么更换TensorFlow的版本,关于TensorFlow和cuda对应的版本,TensorFlow官方给出了如下信息

官方文档说明:https://www.tensorflow.org/install/source?hl=zh-cn

通过上面的版本对应表可以发现,TensorFlow_gpu-1.14.0所对应的cuda的版本应该是10.0,我最终更改了cuda的版本解决了这个问题。

tensorflow报No OpKernel was registered to support Op ‘NcclAllReduce‘相关推荐

  1. NVIDIA Jetson Xavier NX上导入tensorflow报错:AttributeError: module ‘wrapt‘ has no attribute ‘ObjectProxy‘

    欢迎大家关注笔者,你的关注是我持续更博的最大动力 原创文章,转载告知,盗版必究 在Jetson Xavier NX上导入tensorflow报错:AttributeError: module 'wra ...

  2. navicat 连接 mysql 报错:client does not support authentication protocal requested by server

    标题 navicat 连接 mysql 报错:client does not support authentication protocal requested by server 转载自:https ...

  3. anconda安装后命令行中安装tensorflow报错

    现象  anconda安装后命令行中安装tensorflow报错 pip install --upgrade --ignore-installed tensorflow-gpu Building wh ...

  4. 解决tensorflow报错:AttributeError: module ‘tensorflow.keras.backend‘ has no attribute ‘get_session‘ 问题

    欢迎大家关注笔者,你的关注是我持续更博的最大动力 原创文章,转载告知,盗版必究 解决tensorflow报错:AttributeError: module 'tensorflow.keras.back ...

  5. Navicat 远程连接docker容器中的mysql 报错1251 - Client does not support authentication protocol 解决办法

    Navicat 远程连接docker容器中的mysql 报错1251 - Client does not support authentication protocol 解决办法 1).容器中登录my ...

  6. 用pip安装tensorflow报错SyntaxError: invalid syntax

    用pip安装tensorflow报错SyntaxError: invalid syntax 解决办法:直接在cmd中输入安装语句

  7. 服务器安装opencv报错--libSM.so.6: cannot open shared ...+tensorflow 报错libcusolver.so.8.0: can not...

    1.安装opencv出现以下错误: pip install opencv-contrib-python apt-get install -y python-qt4 apt-get install tk ...

  8. Ubuntu安装tensorflow报错:tensorflow-xx.whl not a supported wheel on this platform

    解决Ubuntu安装tensorflow报错:tensorflow-0.5.0-cp27-none-linux_x86_64.whl is not a supported wheel on this ...

  9. 【已解决】Python安装TensorFlow报错“Consider adding this directory to PATH or, if you prefer to suppress this

    [已解决]Python安装TensorFlow报错"Consider adding this directory to PATH or, if you prefer to suppress ...

最新文章

  1. python-selenum3 第五天定位——不常用定位与css定位详
  2. 吴敏霞(为奥运冠军名字作诗)
  3. jQuery学习——表单
  4. 2016 Multi-University Training Contest 10 [HDU 5861] Road (线段树:区间覆盖+单点最大小)...
  5. python open追加模式_Python文件操作,open读写文件,追加文本内容实例
  6. 神策数据携手绿城服务 筑就幸福绿城数据驱动
  7. js实现oss批量下载文件_js下载文件到本地各种方法总结
  8. 跨浏览器开发工作小结
  9. html中article、section、aside的区别与联系
  10. X86和X86_64和AMD64的由来
  11. 手机联系人头像包_一组抖音上很火的表情包,这里都有,一起来可可爱爱吧
  12. 【前端】【cornerstone】【未完善】cornerstone重新加载图像大小问题——拒绝花里胡哨
  13. 数据结构与算法python—1.数据结构与算法入门
  14. IOS 10.3.3 Meridian越狱
  15. SVM支持向量机原理(一) 线性支持向量机
  16. 基于web的标签设计,打印工具,超diao
  17. 麒麟子Javascript游戏编程零基础教程六:Javascript中的实数类型number
  18. 论文阅读:RICAP: Random Image Cropping and Patching Data Augmentation for Deep CNNs
  19. Python每日笔记打卡_day3
  20. String类型转换成LocalDate 和 LocalDateTime

热门文章

  1. 前端-html-01
  2. linux查看磁盘占用情况
  3. v-for和v-if与v-show能否一起使用
  4. PDF转换成PPT简便的方法
  5. GameFi独角兽区块帝国,今日开启全球IDO
  6. 又要去迪士尼了。。。
  7. java制作玩游戏并支付游戏币_Java7循环结构综合练习
  8. 从零开始建网站,新手小白建站必知的十大忠告
  9. java type proposals_eclipse自动提示类型的作用
  10. php printer.dll扩展,php printer.dll下载