精华内容
下载资源
问答
  • RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. 报错信息 报错信息: RuntimeError: Expected to have finished reduction in the prior iteration before...

    RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

    报错信息

    报错信息:

    RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

    遇到这个报错的原因可能有很多,设置torch.nn.parallel.DistributedDataParallel的参数find_unused_parameters=True之类的方法就不提了,报错信息中给的很清楚,看不懂的话google翻译一下即可。

    运行时错误:预计在开始新迭代之前已完成前一次迭代的减少。此错误表明您的模块具有未用于产生损耗的参数。您可以通过 (1) 将关键字参数 find_unused_parameters=True 传递给 torch.nn.parallel.DistributedDataParallel 来启用未使用的参数检测; (2) 确保所有 forward 函数输出都参与计算损失。如果您已经完成了上述两个步骤,那么分布式数据并行模块无法在模块的 forward 函数的返回值中定位输出张量。报告此问题时,请包括损失函数和模块 forward 返回值的结构(例如 list、dict、iterable)。

    如果改个参数能够就能够解决你的问题的话,你也不会找到这篇博客了^^。

    解决方法(之一)

    这里其实报错的最后一句值得注意:

    如果您已经完成了上述两个步骤,那么分布式数据并行模块无法在模块的 forward 函数的返回值中定位输出张量。报告此问题时,请包括损失函数和模块 forward 返回值的结构(例如 list、dict、iterable)。

    但是第一次遇到这个问题只看官方的提示信息可能还是云里雾里,这里笔者将自己的理解和解决过程分享出来。

    说的简单点,其实就一句话:确保你的所有的forward的函数的所有输出都被用于计算损失函数了

    注意,不仅仅是你的模型的forward函数的输出,可能你的损失函数也是通过forward函数来计算的。也就是说,所有继承自nn.Module的模块(不只是模型本身)的forward函数的所有输出都要参与损失函数的计算

    笔者本身遇到的问题就是,在多任务学习中,损失函数是通过一个整个继承自nn.Module的模块来计算的,但是在forward返回的loss中少加了一个任务的loss,导致这个报错。

    
    class multi_task_loss(nn.Module):
        def __init__(self, device, batch_size):
            super().__init__()
            self.ce_loss_func = nn.CrossEntropyLoss()
            self.l1_loss_func = nn.L1Loss()
            self.contra_loss_func = ContrastiveLoss(batch_size, device)
        
        def forward(self, rot_p, rot_t, 
                    pert_p, pert_t, 
                    emb_o, emb_h, emb_p,
                    original_imgs, rect_imgs):
            rot_loss = self.ce_loss_func(rot_p, rot_t)
            pert_loss = self.ce_loss_func(pert_p, pert_t)
            contra_loss = self.contra_loss_func(emb_o, emb_h) \
                            + self.contra_loss_func(emb_o, emb_p) \
                            + self.contra_loss_func(emb_p, emb_h)
            rect_loss = self.l1_loss_func(original_imgs, rect_imgs)
    
    
            # tol_loss = rot_loss + pert_loss + rect_loss 				# 少加了一个loss,但是所有loss都返回了
    		tol_loss = rot_loss + pert_loss + contra_loss + rect_loss 		# 修改为此行后正常
    
            return tol_loss, (rot_loss, pert_loss, contra_loss, rect_loss)
    
    

    读者可以检查一下自己整个的计算过程中(不只是模型本身),是否所有的forward的函数的所有输出都被用于计算损失函数了。

    Ref:

    https://discuss.pytorch.org/t/need-help-runtimeerror-expected-to-have-finished-reduction-in-the-prior-iteration-before-starting-a-new-one/119247

    展开全文
  • File "train.py", line 166, in <module> main(args_) File "train.py", line 118, in main f_clean_masked, f_occ_masked, fc, fc_occ = backbone(img1, img2) File "/home/user1/miniconda3/envs/py377...

    报错:

    Traceback (most recent call last):
      File "train.py", line 166, in <module>
        main(args_)
      File "train.py", line 118, in main
        f_clean_masked, f_occ_masked, fc, fc_occ = backbone(img1, img2)
      File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 692, in forward
    Traceback (most recent call last):
      File "train.py", line 166, in <module>
        if self.reducer._rebuild_buckets():
        main(args_)
      File "train.py", line 118, in main
    RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
    
    

    解决一:网上很多都是在 DistributedDataParallel 的调用中增加参数 find_unused_parameters=true。这里没有尝试这种方法

    解决二:网上还有一种说法是,发生这种情况的原因可能是,你在用 DistributedDataParallel 去包裹 计算图的时候,有一部分计算图露在外面。比如说,你先定义了 DistributedDataParallel ,然后又写了几步 计算。等等。这种情况下解决的方法就是把那几步写在前面,然后再 定义 DistributedDataParallel 这个类。我这里也不是这种情况

    解决三:一个网络多路输入的情况。我的网络有两路输入,我先依次做了两路前向,然后再依次做两路后向、参数更新(opt.step)、梯度置0 (zero_grad) 的时候报的这个错。最后的解决是: 做完一次前向,再做一次后向、参数更新、梯度置0 ;然后再做另一次前向,另一次后向、参数更新、梯度置0 。
    然后没有再报这个错。


    类似的报错还出现在 网络级联的时候

    展开全文
  • Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value ...

    Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s
    forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list,
    dict, iterable)

    这个错误的重现:

    原始代码:

    n_finetune_classes = 40
    
    model = torch.nn.DataParallel(model, device_ids=None)
    pretrain = torch.load("pretrain_path")
    model.load_state_dict(pretrain['state_dict'], strict=False)
    
    # 由于预训练模型的fc层是原始的,这里我需要finetune,所以修改了fc层
    model.module.fc = torch.nn.Linear(model.module.fc.in_features,
                                      n_finetune_classes)
    model.module.fc = model.module.fc.cuda()
    

    后来,由于采用了分布式训练的方式于是有以下代码:

    n_finetune_classes = 40
    
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=None)
    pretrain = torch.load("pretrain_path")
    model.load_state_dict(pretrain['state_dict'], strict=False)
    
    # 由于预训练模型的fc层是原始的,这里我需要finetune,所以修改了fc层
    model.module.fc = torch.nn.Linear(model.module.fc.in_features,
                                      n_finetune_classes)
    model.module.fc = model.module.fc.cuda()
    

    分析

    于是就出现了上述错误,这个错误的字面意思是说,有模型没有累积的参数,也就是说有一些参数没有被加入到模型的参数中。所以可以得到初步结论,使用分布式的框架时候某些参数某有更新,但是这些参数在其他进程中可能会更新,从这个错误来看DistributedDataParallel希望你使用find_unused_parameters=True来让其他子进程都不更新这个不用更新的参数。

    定位

    由于没有修改fc之前没有这个错误,所以基本上可以认为是:修改的这个fc层,没有被涵盖进DistributedDataParallel这个容器包裹的module中。于是查阅官方文档找到:在这里插入图片描述
    那为什么DataParallel没有这个问题? DP的方式是有个主卡,负责分发,也就是说其他卡是从主卡分发下去的。但是DDP是在你用它包裹model时,会为model的体得注册梯度聚合函数,这个函数就是用来将多卡梯度求和平均得到的。

    修复

    n_finetune_classes = 40
    
    model.fc = torch.nn.Linear(model.fc.in_features,
                                      n_finetune_classes)
    model.fc = model.fc.cuda()
    
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=None)
    pretrain = torch.load("pretrain_path")
    model.load_state_dict(pretrain['state_dict'], strict=False)
    
    
    展开全文
  • 问题出现在多GPU种的DDP训练。 两种改法: 一、... 即find_unused_parameters=True 二、逐层对照网络,将网络中无效的网络层删除。...核心问题就是构建的网络中存在实际不参与forward的部分。......

    问题出现在多GPU种的DDP训练。

    两种改法:

    一、https://blog.csdn.net/weixin_36670529/article/details/106729116

    即find_unused_parameters=True

    二、逐层对照网络,将网络中无效的网络层删除。

    核心问题就是构建的网络中存在实际不参与forward的部分。

    两种修改方式时间上的差异不是很明显。

    展开全文
  • 【比特币】 Histories of bitcoin releases

    千次阅读 2017-10-13 20:00:56
    If you have json-rpc code that checks the contents of the error string, you need to change it to expect error objects of the form {“code”:,”message”:}, which is the standard. See this thread: ...

空空如也

空空如也

1
收藏数 5
精华内容 2
关键字:

runtimeerror:expectedtohavefinishedreductionintheprioriterationbef