

借助最新的19.03.0 Beta版本,现在您无需花时间下载NVIDIA-DOCKER插件,而无需依靠nvidia-wrapper来启动GPU容器。现在,您可以在docker runCLI中使用–gpus选项,以允许容器无缝使用GPU设备。

New Docker CLI API Support for NVIDIA GPUs under Docker Engine 19.03.0 Pre-Release







$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

如果在安装时有选择cuda sample并且没有跑过其它程序,可以编译其路径下的deviceQuery

$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
$ make
$ ./deviceQuery


./deviceQuery Starting...CUDA Device Query (Runtime API) version (CUDART static linking)Detected 1 CUDA Capable device(s)Device 0: "Tesla T4"CUDA Driver Version / Runtime Version          11.0 / 10.0CUDA Capability Major/Minor version number:    7.5Total amount of global memory:                 15110 MBytes (15843721216 bytes)(40) Multiprocessors, ( 64) CUDA Cores/MP:     2560 CUDA CoresGPU Max Clock rate:                            1590 MHz (1.59 GHz)Memory Clock rate:                             5001 MhzMemory Bus Width:                              256-bitL2 Cache Size:                                 4194304 bytesMaximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layersMaximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layersTotal amount of constant memory:               65536 bytesTotal amount of shared memory per block:       49152 bytesTotal number of registers available per block: 65536Warp size:                                     32Maximum number of threads per multiprocessor:  1024Maximum number of threads per block:           1024Max dimension size of a thread block (x,y,z): (1024, 1024, 64)Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)Maximum memory pitch:                          2147483647 bytesTexture alignment:                             512 bytesConcurrent copy and kernel execution:          Yes with 3 copy engine(s)Run time limit on kernels:                     NoIntegrated GPU sharing Host Memory:            NoSupport host page-locked memory mapping:       YesAlignment requirement for Surfaces:            YesDevice has ECC support:                        EnabledDevice supports Unified Addressing (UVA):      YesDevice supports Compute Preemption:            YesSupports Cooperative Kernel Launch:            YesSupports MultiDevice Co-op Kernel Launch:      YesDevice PCI Domain ID / Bus ID / location ID:   0 / 0 / 3Compute Mode:< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS


CUDA driver version is insufficient for CUDA runtime version
Result = Fail


I suspect you somehow ended up with CUDA runtime libraries installed into your image from a host machine that are a mismatch with the driver version running on your current host. How did you generate the submarineas/centos:v0.1 image?
You can’t do this. The image must run with the host CUDA libraries injected into it. This is one of the primary functionalities that nvidia-dockerprovides. To fix the situation, you need to go into /usr/lib/x86_64-linux-gnu/ inside your container and remove any files of the form *.so. (e.g. libnvidia-ml.so.410.104) that don’t match the driver version on your host.

docker状态查看为systemctl status docker.service

Sep 08 21:37:02 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:02.895532061+08:00" level=info msg="ccResolverWrapper: sending update ...ule=grpc
Sep 08 21:37:02 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:02.895547307+08:00" level=info msg="ClientConn switching balancer to \...ule=grpc
Sep 08 21:37:02 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:02.904539499+08:00" level=info msg="[graphdriver] using prior storage ...verlay2"
Sep 08 21:37:03 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:03.235120765+08:00" level=info msg="Loading containers: start."
Sep 08 21:37:05 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:05.503083212+08:00" level=info msg="Default bridge (docker0) is assign...address"
Sep 08 21:37:06 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:06.347242198+08:00" level=info msg="Loading containers: done."
Sep 08 21:37:06 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:06.507743081+08:00" level=info msg="Docker daemon" commit=633a0ea grap...=19.03.5
Sep 08 21:37:06 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:06.507838124+08:00" level=info msg="Daemon has completed initialization"
Sep 08 21:37:06 iZwz9dnzb8iugujf36fuw9Z dockerd[2493]: time="2020-09-08T21:37:06.574231587+08:00" level=info msg="API listen on /var/run/docker.sock"


Sep 08 14:11:41 10-9-111-182 dockerd[1058]: time="2020-09-08T14:11:41.522856125+08:00" level=error msg="Handler for POST /v1.40/containers/a3d065de1ea9/restar>
Sep 08 15:57:46 10-9-111-182 dockerd[1058]: time="2020-09-08T15:57:46.154184854+08:00" level=error msg="stream copy error: reading from a closed fifo"
Sep 08 15:57:46 10-9-111-182 dockerd[1058]: time="2020-09-08T15:57:46.154205023+08:00" level=error msg="stream copy error: reading from a closed fifo"

这里也别问我为什么我会有这种东西,被各种垃圾的博客坑得不要不要的。。。如果docker有这种错误,说明之前有修改过daemon.json文件,并错误的进行了更新或删除,Docker daemon已经失效,所以需要重启docker即可。而如果是启动的容器出现这个日志,那么需要将端口以及数据卷全部换了重做即可。




curl https://get.docker.com | shsudo systemctl start docker && sudo systemctl enable docker# 设置stable存储库和GPG密钥:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list# 要访问experimental诸如WSL上的CUDA或A100上的新MIG功能之类的功能,您可能需要将experimental分支添加到存储库列表中.
# 可加可不加
curl -s -L https://nvidia.github.io/nvidia-container-runtime/experimental/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list# nvidia-docker2更新软件包清单后,安装软件包(和依赖项):
sudo apt-get updatesudo apt-get install -y nvidia-docker2# 设置默认运行时后,重新启动Docker守护程序以完成安装:
sudo systemctl restart docker


distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | \sudo tee /etc/yum.repos.d/nvidia-container-runtime.reposudo yum-config-manager --enable libnvidia-container-experimental
sudo yum-config-manager --enable nvidia-container-experimentalsudo yum-config-manager --disable libnvidia-container-experimental
sudo yum-config-manager --disable nvidia-container-runtime-experimental

因为nvidia-docker就三个核心的东西,一个nvidia-container-runtime,一个libnvidia-container-experimental,还有一个cudakit. 但有官方文档最好还是照着官方来,两篇例子分别为:





# nvidia-docker:nvidia-container-toolkit的安装方式
docker run --gpus=all --rm nvidia/cuda:10.0-base nvidia-smi# nvidia-docker2
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all --rm nvidia/cuda:10.0-base nvidia-smi
nvidia-docker run -e NVIDIA_VISIBLE_DEVICES=all --rm nvidia/cuda:10.0-base nvidia-smi


could not select device driver “” with capabilities: [[gpu]].

这个问题在docker2上是不会发生的,根据issue 1034的其中一个贡献者原话为:

If you didn’t already make sure you’ve installed the nvidia-container-toolkit.
If this doesn’t fix it for you, make sure you’ve restarted docker systemctl restart dockerd


ldcache error: open failed: /sbin/ldconfig.real: no such file or directory\\n\""": unknown.



sudo ldconfig -v # 显示所有链接
ldconfig    # 不报错ln -s /sbin/ldconfig /sbin/ldconfig.real



 sudo ln -sf /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2 /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7


starting container process caused “exec: “nvidia-smi”: executable file not found in $PATH”


首先根据语义,container没有找到路径,如果后面还跟了比如说cuda >= 的字样,那么就能确定是cuda版本不对,我们首先查看docker的volume:

$ nvidia-docker volume ls
local               f32bc4d3933b47c923b0e3e86222e2476e7131566950daad756790bc4129626d
nvidia-docker       nvidia_driver_450.51.06


docker volume create --driver=nvidia-docker --name=nvidia_driver_$(modinfo -F version nvidia)


systemctl status docker.service      # 查看docker日志
sudo systemctl start nvidia-docker.service      # 查看nvidia-docker日志


● docker.service - Docker Application Container EngineLoaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)Drop-In: /etc/systemd/system/docker.service.d└─override.confActive: active (running) since Tue 2020-09-08 11:09:10 CST; 7min agoDocs: https://docs.docker.comMain PID: 30459 (dockerd)Tasks: 40Memory: 66.7MCGroup: /system.slice/docker.service├─30459 /usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime├─30610 /usr/bin/docker-proxy -proto tcp -host-ip -host-port 3306 -container-ip -container-port 3306├─30672 /usr/bin/docker-proxy -proto tcp -host-ip -host-port 15672 -container-ip -container-port 15672└─30688 /usr/bin/docker-proxy -proto tcp -host-ip -host-port 5672 -container-ip -container-port 5672Sep 08 11:09:09 10-9-111-182 dockerd[30459]: time="2020-09-08T11:09:09.703146487+08:00" level=info msg="Loading containers: start."
Sep 08 11:09:09 10-9-111-182 dockerd[30459]: time="2020-09-08T11:09:09.817109186+08:00" level=info msg="Default bridge (docker0) is assigned with an IP address Daemon option --


$ yum search libcuda
Repository libnvidia-container is listed more than once in the configuration
Repository libnvidia-container-experimental is listed more than once in the configuration
Repository nvidia-container-runtime is listed more than once in the configuration
Repository nvidia-container-runtime-experimental is listed more than once in the configuration
Last metadata expiration check: 0:14:56 ago on Tue 08 Sep 2020 01:50:04 PM CST.
No matches found.


nvidia-container-cli -k -d /dev/tty info"""
-- WARNING, the following logs are for debugging purposes only --I0908 06:06:06.277294 106114 nvc.c:282] initializing library context (version=1.3.0, build=af0220ff5c503d9ac6a1b5a491918229edbb37a4)
I0908 06:06:06.277332 106114 nvc.c:256] using root /
I0908 06:06:06.277337 106114 nvc.c:257] using ldcache /etc/ld.so.cache
I0908 06:06:06.277341 106114 nvc.c:258] using unprivileged user 65534:65534
I0908 06:06:06.277362 106114 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0908 06:06:06.277498 106114 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment
I0908 06:06:06.278499 106115 nvc.c:192] loading kernel module nvidia
I0908 06:06:06.278650 106115 nvc.c:204] loading kernel module nvidia_uvm
I0908 06:06:06.278713 106115 nvc.c:212] loading kernel module nvidia_modeset.......CUDA version:   11.0Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Tesla
GPU UUID:       GPU-8546d1d2-7f12-2014-2498-6738e7ac1d2b
Bus Location:   00000000:00:03.0
Architecture:   7.5
I0908 08:22:40.155167 15854 nvc.c:337] shutting down library context
I0908 08:22:40.223031 15856 driver.c:156] terminating driver service
I0908 08:22:40.223527 15854 driver.c:196] driver service terminated successfully"""

这里有一个问题是dxcore initialization failed, continuing assuming a non-WSL environment,但我没有用过win的东西,可能跟libnvidia-container-experimental有关,如果看过前面那篇nvidia-container-time的文章,那yum search libcuda搜索下或者apt-get搜索下是有的,就不用管这个问题了。


nvidia-docker run --rm nvidia/cuda:10.0-devel "echo $PATH"




