单机单卡，单机多卡，多机多卡训练代码

单机多卡------>多机多卡：相当于把单进程的代码改成多进行的。

数据并行

PyTorch默认使用从0开始的GPU，且默认只使用0号GPU。如果要使用其他编号的GPU或者使用多块GPU，则要设置。
pytorch并行后，假设batchsize设置为64，表示每张并行使用的GPU都使用batchsize=64来计算（单张卡使用时，使用batchsize=64比较合适时，多张卡并行时，batchsize仍为64比较合适，而不是64*并行卡数）。
DataParallel 会自动拆分数据，并将作业订单发送到多个GPU上的多个模型。在每个模型完成它们的工作之后，DataParallel 在将结果返回给你之前收集和合并结果。

使用gpu训练的两种代码方式

参考：https://blog.csdn.net/xys430381_1/article/details/106635977
https://blog.csdn.net/zqx951102/article/details/127946871
https://blog.csdn.net/leo0308/article/details/119721078
一、使用环境变量CUDA_VISIBLE_DEVICES的方式
第一步：指定gpu
直接终端中设定：

CUDA_VISIBLE_DEVICES=1

python代码中设定：
1、使用单卡

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

2、使用多块卡的方法。

import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0,1'

第二步：创建设备（device）
作用：将备选GPU进一步选择和指定，真正投入使用中。

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 注意如果 device = torch.device("cuda")，则环境变量CUDA_VISIBLE_DEVICES中指定的全部GPU都会被拿来使用。
# 也可以通过 "cuda:0" 、"cuda:1"等指定环境变量CUDA_VISIBLE_DEVICES中指定的多块GPU中的某一块。

注意对应关系。例如：

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2, 3, 4, 5"  # 将2, 3, 4, 5号GPU作为备选GPU# 这样cuda:0表示使用 2 号GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

第三步，将data和model放置到device上

input = data.to(device)
model = MyModule(...).to(device)
注意：如果有多个GPU，则model还需要多做一个操作（模型并行化）
第三步骤的多GPU版本如下：input = data.to(device)
model = MyModule(...)
#模型并行化
if torch.cuda.device_count() > 1:print("Let's use", torch.cuda.device_count(), "GPUs!")model = nn.DataParallel(model)
model = model.to(device)

方法二函数 set_device + 函数.cuda()

不过官方建议使用CUDA_VISIBLE_DEVICES，不建议使用 set_device 函数。
第一步，函数set_device设置device

import torch
gpu_id = [0, 1, 2]
torch.cuda.set_device(gpu_id)
#运行这里会报错，set_device只能传int，也就是只能用一个gpu训练？

第二部，函数.cuda()使用GPU

data.cuda()
model.cuda()

第二种方法补充：
torch.cuda.set_device(gpu_id)只能指定单个gpu
如果只需要指定一张卡，可以使用torch.cuda.set_device(1)指定gpu使用编号
(不建议用这个方法)

torch.cuda.set_device(1)
print(torch.cuda.device_count()) #可用GPU数量
（我的机器是4卡，所以print结果是：4，说明用torch.cuda.set_device(1)指定，不会改变可见的显卡）
后面还可以用torch.nn.DataParallel(model, device_ids=[1, 2])进行指定，但是必须包含set_device(1)指定的device:1的设备，缺点是仍然会存在占用一些device:0的gpu内存；

小试牛刀：
https://github.com/rwightman/pytorch-image-models
训练代码是第二种方式，只能用1个gpu训练。
修改方法一，直接加上一句nn.DataParallel：
原始代码：

model.cuda()

修改为：

if torch.cuda.device_count() > 1:print("Let's use", torch.cuda.device_count(), "GPUs!")model = nn.DataParallel(model, device_ids=[0, 1])   ##指定gpumodel.cuda()

修改前后用nvidia-smi查看gpu使用情况。

修改方法二，较复杂：

os.environ["CUDA_VISIBLE_DEVICES"] = '0,1'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
...
# torch.cuda.set_device(args.local_rank)if torch.cuda.device_count() > 1:print("Let's use", torch.cuda.device_count(), "GPUs!")model = nn.DataParallel(model)# model.cuda()model.to(device)# train_loss_fn = train_loss_fn.cuda()# validate_loss_fn = nn.CrossEntropyLoss().cuda()train_loss_fn = train_loss_fn.to(device)validate_loss_fn = nn.CrossEntropyLoss().to(device)# input, target = input.cuda(), target.cuda()input = input.to(device)target = target.to(device)# input = input.cuda()# target = target.cuda()input = input.to(device)target = target.to(device)

方法一不用分布式训练不报错，用分布式训练报错：
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
方法二不报错。

多机器训练ddp
https://copyfuture.com/blogs-details/20211120101612016u
分布式训练中涉及三个参数
WORLD_SIZE、RANK、LOCAL_RANK
在单机多卡的情况下：WORLD_SIZE代表着使用进程数量(一个进程对应一块或多块GPU)，这里RANK和LOCAL_RANK这里的数值是一样的，代表着WORLD_SIZE中的第几个进程。
在多机多卡的情况下：WORLD_SIZE代表着所有机器中总进程数(一个进程对应一块GPU)，RANK代表着是在WORLD_SIZE中的哪一个进程，LOCAL_RANK代表着当前机器上的第几个进程（GPU）。

WORLD_SIZE参数，总共有几个进程，根据个数区分是否是分布式训练。

if 'WORLD_SIZE' in os.environ:args.distributed = int(os.environ['WORLD_SIZE']) > 1

WORLD_SIZE = int(os.getenv('WORLD_SIZE', 1))

os.getenv('WORLD_SIZE', 1)在非分布式训练时等于1
int(os.environ['WORLD_SIZE'])在非分布式训练时等于1，分布式训练时大于1的特征进行区分。

LOCAL_RANK参数，第几个进程，并对每个进行进行设置，设置torch.cuda.set_device和初始化。在调用任何 DDP 其他方法之前，需要使用torch.distributed.init_process_group()进行初始化进程组。

if 'LOCAL_RANK' in os.environ:args.local_rank = int(os.getenv('LOCAL_RANK'))

if args.distributed:if 'LOCAL_RANK' in os.environ:args.local_rank = int(os.getenv('LOCAL_RANK'))args.device = 'cuda:%d' % args.local_ranktorch.cuda.set_device(args.local_rank)print("cuda:",args.local_rank)torch.distributed.init_process_group(backend='nccl', init_method='env://')args.world_size = torch.distributed.get_world_size()args.rank = torch.distributed.get_rank()

使用2个gpu时，输出

cuda: 0
cuda: 1

模型创建，每个进程创建一个模型，需要传参local_rank，即当前进程id。

from torch.nn.parallel import DistributedDataParallel as NativeDDP
from apex.parallel import DistributedDataParallel as ApexDDP
# setup distributed training
if args.distributed:if has_apex and use_amp == 'apex':# Apex DDP preferred unless native amp is activatedif args.local_rank == 0:_logger.info("Using NVIDIA APEX DistributedDataParallel.")model = ApexDDP(model, delay_allreduce=True)else:if args.local_rank == 0:_logger.info("Using native Torch DistributedDataParallel.")model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)# NOTE: EMA model does not need to be wrapped by DDP

# DDP mode
if cuda and RANK != -1:model = DDP(model, device_ids=[LOCAL_RANK], output_device=LOCAL_RANK)

dataloader，每个进程创建一个datalodar，需要传参LOCAL_RANK，即当前进程id。

# Trainloader
train_loader, dataset = create_dataloader(train_path, imgsz, batch_size // WORLD_SIZE, gs, single_cls,hyp=hyp, augment=True, cache=None if opt.cache == 'val' else opt.cache,rect=opt.rect, rank=LOCAL_RANK, workers=workers,image_weights=opt.image_weights, quad=opt.quad,prefix=colorstr('train: '), shuffle=True)

loss计算，分布式训练loss=loss×进程数（WORLD_SIZE）

# Trainloader
train_loader, dataset = create_dataloader(train_path, imgsz, batch_size // WORLD_SIZE, gs, single_cls,hyp=hyp, augment=True, cache=None if opt.cache == 'val' else opt.cache,rect=opt.rect, rank=LOCAL_RANK, workers=workers,image_weights=opt.image_weights, quad=opt.quad,prefix=colorstr('train: '), shuffle=True)

with amp.autocast(enabled=cuda):pred = model(imgs)  # forwardloss, loss_items = compute_loss(pred, targets.to(device))  # loss scaled by batch_sizeif RANK != -1:loss *= WORLD_SIZE  # gradient averaged between devices in DDP mode

if args.distributed:reduced_loss = utils.reduce_tensor(loss.data, args.world_size)losses_m.update(reduced_loss.item(), input.size(0))

打印，只对一个进程进行日志输出

if args.local_rank == 0:_logger.info('Using NVIDIA APEX AMP. Training in mixed precision.')

直接python train.py启动的只有一个进程，该进程中可以用一个gpu，也可以用多个，由代码中的cuda设置决定，即python train.p方式启动的WORLD_SIZE=1
使用分布式训练指令启动

export NCCL_IB_DISABLE=1; export NCCL_P2P_DISABLE=1; NCCL_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr "10.17.21.11" --master_port 12345 train.py

这种方式启动，进程数等于nproc_per_node×nnodes=2，即WORLD_SIZE=2.

export NCCL_IB_DISABLE=1; export NCCL_P2P_DISABLE=1; NCCL_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 0 --master_addr "10.17.21.11" --master_port 12345 train.py

这种方式启动，进程数等于nproc_per_node×nnodes=8，即WORLD_SIZE=8.

总结：每个进程创建一个dataloader，创建一个ddp模型，loss=loss×进程数，反向传播。