精华内容
下载资源
问答
  • 我们对Pytorchdebug一般都是在python端进行,这对于一般搭建模型的任务来说足够了。但如果我们需要对Pytorch进行一些修改或者研究一下机器或深度学习系统是如何搭建的,想要深入探索就必须涉及到C++的源码层面。 ...

    转载自:https://oldpan.me/archives/how-to-debug-pytorch-deeper

    前言

    我们对Pytorch的debug一般都是在python端进行,这对于一般搭建模型的任务来说足够了。但如果我们需要对Pytorch进行一些修改或者研究一下机器或深度学习系统是如何搭建的,想要深入探索就必须涉及到C++的源码层面。

    举个栗子,例如torch.rand(3, 4)这个函数,在Python我们无法通过python端debug进入其内部实现,也无法找到其定义,自然也无法探索其具体的实现细节,所以,为了更好地对Pytorch进行探索和调试,有必要对Pytorch的C++部分进行debug。

    准备工作

    首先我们需要Pycharm+VSCODE(linux端),当然也要有python环境和gdb(这个一般都有),然后创建虚拟环境并编译Pytorch的源码。

    既然要对Pytorch的源码进行debug,首先我们需要对Pytorch的源码进行编译。编译时需要修改DEBUG环境变量,编译Debug版的pytorch,命令为DEBUG=1 python setup.py install,更多详细的编译步骤看下面这篇文章,这里不赘述了:

    https://oldpan.me/archives/pytorch-build-simple-instruction

    编译好Pytorch之后,我们用VSCODE打开Pytorch的目录,打开整个工程文件,然后点击左侧的debug图标第一次启动debug,VSCODE会提示我们添加launch.json配置文件,这个文件在.vscode目录下。

    然后我们修改launch.json文件,主要是修改program这一栏为python解释器的路径,其他的不用改动:

    {
        // 使用 IntelliSense 了解相关属性。 
        // 悬停以查看现有属性的描述。
        // 欲了解更多信息,请访问: https://go.microsoft.com/fwlink/?linkid=830387
        "version": "0.2.0",
        "configurations": [
            {
                "name": "(gdb) Attach",
                "type": "cppdbg",
                "request": "attach",
                "program": "/home/darknet/miniconda3/envs/torch1101/bin/python",   //-- 修改这一栏为你执行pytorch的python路径
                "processId": "${command:pickProcess}",
                "MIMode": "gdb",
                "setupCommands": [
                    {
                        "description": "Enable pretty-printing for gdb",
                        "text": "-enable-pretty-printing",
                        "ignoreFailures": true
                    }
                ]
    }
    

    开始Debug

    我们debug的原理是:

    首先运行python代码,获得当前运行的id进程号
    其次通过gdb捕获这个进程号进而对C++代码进行debug
    因为有一个捕获的过程,为了防止我们捕获前程序已经跑过我们在C++中提前设置的断点,所以需要首先在我们要运行的.py文件中最开始添加一个sleep语句先让程序在空中飞一会,时间自己定我一般是time.sleep(50)。另外提前在你要break的C++代码中设置断点,在VScode中对着你要中断的代码行数点击一下就可以设置。

    然后以debug模式运行pytorch代码(在pycharm中点击debug按钮),在console中可以看到此时的进程是28536。
    在这里插入图片描述

    点击VSCODE中的debug,这个我们之前已经进行了设置:
    在这里插入图片描述

    此时输入我们之前的进程号,进行attach,注意此时可能系统会要求root权限,输入y确定就好。
    在这里插入图片描述

    等待一会之后(time.sleep的时间),我们就可以看到程序在我们设置的断点处停了下来,此刻我们已经从pytorch的python前端进入了C++后端:
    在这里插入图片描述

    其中debug的信息如下:

    GNU gdb (Ubuntu 8.2-0ubuntu1~16.04.1) 8.2
    Copyright (C) 2018 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.
    Type "show copying" and "show warranty" for details.
    This GDB was configured as "x86_64-linux-gnu".
    Type "show configuration" for configuration details.
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>.
    Find the GDB manual and other documentation resources online at:
        <http://www.gnu.org/software/gdb/documentation/>.
    
    For help, type "help".
    Type "apropos word" to search for commands related to "word".
    Warning: Debuggee TargetArchitecture not detected, assuming x86_64.
    =cmd-param-changed,param="pagination",value="off"
    [New LWP 15524]
    [New LWP 15525]
    [New LWP 15528]
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    0x00007fe9efc605d3 in select () at ../sysdeps/unix/syscall-template.S:84
    Loaded '/lib/x86_64-linux-gnu/libpthread.so.0'. Symbols loaded.
    Loaded '/lib/x86_64-linux-gnu/libc.so.6'. Symbols loaded.
    Loaded '/lib/x86_64-linux-gnu/libdl.so.2'. Symbols loaded.
    Loaded '/lib/x86_64-linux-gnu/libutil.so.1'. Symbols loaded.
    Loaded '/lib/x86_64-linux-gnu/librt.so.1'. Symbols loaded.
    Loaded '/lib/x86_64-linux-gnu/libm.so.6'. Symbols loaded.
    Loaded '/lib64/ld-linux-x86-64.so.2'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_heapq.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_opcode.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/zlib.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/../../libz.so.1'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_bz2.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_lzma.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/../../liblzma.so.5'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/grp.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_posixsubprocess.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/select.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/math.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_hashlib.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/../../libcrypto.so.1.1'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_blake2.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_sha3.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_bisect.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_random.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_struct.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/binascii.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_socket.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_datetime.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_ssl.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/../../libssl.so.1.1'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/../../libffi.so.6'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/_mklinit.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/../../../libmkl_rt.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/../../../libiomp5.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/bin/../lib/libgcc_s.so.1'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/core/_multiarray_umath.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_pickle.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/core/_multiarray_tests.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/linalg/lapack_lite.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/linalg/_umath_linalg.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_decimal.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/fft/fftpack_lite.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/mkl_fft/_pydfti.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/random/mtrand.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/../../../libmkl_core.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/../../../libmkl_intel_thread.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/../../../libmkl_intel_lp64.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/../../../libmkl_avx512.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/../../../libmkl_vml_avx512.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_json.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/pyexpat.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/fcntl.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/termios.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/resource.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/array.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_multiprocessing.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_asyncio.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_sqlite3.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/../../libsqlite3.so.0'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/unicodedata.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/lib-dynload/_lsprof.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/lib/libshm.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/lib/libtorch_python.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/libstdc++.so.6'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/lib/libtorch.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/lib/libc10.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/libmkl_gnu_thread.so'. Symbols loaded.
    Loaded '/home/prototype/anaconda3/envs/pytorch/lib/libgomp.so.1'. Symbols loaded.
    Loaded '/usr/lib/libmpi_cxx.so.1'. Symbols loaded.
    Loaded '/usr/lib/libmpi.so.12'. Symbols loaded.
    Loaded '/usr/lib/x86_64-linux-gnu/libnuma.so.1'. Symbols loaded.
    Loaded '/usr/lib/libibverbs.so.1'. Symbols loaded.
    Loaded '/usr/lib/libopen-rte.so.12'. Symbols loaded.
    Loaded '/usr/lib/libopen-pal.so.13'. Symbols loaded.
    Loaded '/usr/lib/x86_64-linux-gnu/libhwloc.so.5'. Symbols loaded.
    Loaded '/usr/lib/x86_64-linux-gnu/libltdl.so.7'. Symbols loaded.
    [Switching to thread 4 (Thread 0x7fe9de0e0700 (LWP 15528))](running)
    =thread-selected,id="4"
    
    Thread 1 "python" hit Breakpoint 4, torch::autograd::THPVariable_rand (self_=0x0, args=0x7fe9df2cff08, kwargs=0x0) at ../torch/csrc/autograd/generated/python_torch_functions.cpp:8134
    8134	  }, /*traceable=*/true);
    Execute debugger commands using "-exec <command>"
    
    展开全文
  •  datasets from torch.optim.lr_scheduler import StepLR import pytorch_lightning as pl from pytorch_lightning.metrics.functional.classification import accuracy class LitClassifier(pl....

    点击上方“AI公园”,关注公众号,选择加“星标“或“置顶”


    作者:Adrian Wälchli

    编译:ronghuaiyang

    导读

    好的工具和工作习惯可以极大的提升工作效率。

    每一个深度学习项目都是不同的。不管你有多少经验,你总会遇到新的挑战和意想不到的行为。你在项目中运用的技巧和思维方式将决定你多快发现并解决这些阻碍成功的障碍。

    从实践的角度来看,深度学习项目从代码开始。一开始组织它很容易,但是随着项目的复杂性的增加,在调试和完整性检查上花费的时间会越来越多。令人惊讶的是,其中很多都可以自动完成。在这篇文章中,我将告诉你如何去做。

    • 找出为什么你的训练损失没有降低

    • 实现模型自动验证和异常检测

    • 使用PyTorch Lightning节省宝贵的调试时间

    为了演示,我们将使用一个简单的MNIST分类器的例子,这里有几个bug:

    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import torch.optim as optim
    from torch.optim.lr_scheduler import StepLR
    from torch.utils.data import DataLoader
    from torchvision import transforms
    from torchvision.datasets import MNIST
    
    
    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(1, 32, 3, 1)
            self.conv2 = nn.Conv2d(32, 64, 3, 1)
            self.dropout1 = nn.Dropout(0.25)
            self.dropout2 = nn.Dropout(0.5)
            self.fc1 = nn.Linear(9216, 128)
            self.fc2 = nn.Linear(128, 10)
    
        def forward(self, x):
            x = self.conv1(x)
            x = F.relu(x)
            x = self.conv2(x)
            x = F.relu(x)
            x = F.max_pool2d(x, 2)
            x = self.dropout1(x)
            x = torch.flatten(x, 1)
            x = self.fc1(x)
            x = F.relu(x)
            x = self.dropout2(x)
            x = self.fc2(x)
            output = F.log_softmax(x, dim=0)
            return output
    
    
    def train(model, device, train_loader, optimizer, epoch):
        model.train()
        for batch_idx, (x, y) in enumerate(train_loader):
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            output = model(x)
            loss = F.nll_loss(output, y)
            loss.backward()
            optimizer.step()
            if batch_idx % 10 == 0:
                print(f'Epoch: {epoch} [{100. * batch_idx / len(train_loader):.0f}%]\tLoss: {loss.item():.6f}')
    
    
    def test(model, device, test_loader):
        model.eval()
        test_loss = 0
        correct = 0
        with torch.no_grad():
            for x, y in test_loader:
                x, y = x.to(device), y.to(device)
                output = model(x)
                test_loss += F.nll_loss(x, y, reduction='sum').item()  # sum up batch loss
                pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
                correct += pred.eq(y.view_as(pred)).sum().item()
    
        test_loss /= len(test_loader.dataset)
        print(
            f'\nTest set: Average loss: {test_loss:.4f},'
            f' Accuracy: {100. * correct / len(test_loader.dataset):.0f}%\n'
        )
    
    
    def main():
        use_cuda = torch.cuda.is_available()
        device = torch.device("cuda" if use_cuda else "cpu")
        transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(128., 1.),
        ])
        train_dataset = MNIST('./data', train=True, download=True, transform=transform)
        test_dataset = MNIST('./data', train=False, transform=transform)
        train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=1)
        test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False, num_workers=1)
    
        model = Net().to(device)
        optimizer = optim.Adadelta(model.parameters(), lr=1.0)
        scheduler = StepLR(optimizer, step_size=1, gamma=0.7)
    
        epochs = 14
        for epoch in range(1, epochs + 1):
            train(model, device, train_loader, optimizer, epoch)
            test(model, device, test_loader)
            scheduler.step()
    
    
    if __name__ == '__main__':
        main()
    

    这是最原味的MNIST PyTorch代码,改编自github.com/pytorch/examples,如果你运行这段代码,你会发现损失不降,并且在第一个epoch之后,测试循环会崩溃。怎么回事?

    Trick 0: 组织好你的PyTorch代码结构

    在调试此代码之前,我们将把它组织成Lightning格式。PyTorch Lightning将所有的boilerplate/engineering代码自动放在一个Trainer对象中,并整齐地将所有的实际的研究代码放到了LightningModule中,这样我们就可以专注于最重要的部分:

    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    from torch.utils.data import DataLoader
    from torchvision import transforms, datasets
    from torch.optim.lr_scheduler import StepLR
    import pytorch_lightning as pl
    from pytorch_lightning.metrics.functional.classification import accuracy
    
    
    class LitClassifier(pl.LightningModule):
    
        def __init__(self):
            super().__init__()
            self.conv1 = nn.Conv2d(1, 32, 3, 1)
            self.conv2 = nn.Conv2d(32, 64, 3, 1)
            self.dropout1 = nn.Dropout2d(0.25)
            self.dropout2 = nn.Dropout2d(0.5)
            self.fc1 = nn.Linear(9216, 128)
            self.fc2 = nn.Linear(128, 10)
            self.example_input_array = torch.rand(5, 1, 28, 28)
    
        def forward(self, x):
            x = self.conv1(x)
            x = F.relu(x)
            x = self.conv2(x)
            x = F.relu(x)
            x = F.max_pool2d(x, 2)
            x = self.dropout1(x)
            x = torch.flatten(x, 1)
            x = self.fc1(x)
            x = F.relu(x)
            x = self.dropout2(x)
            x = self.fc2(x)
            output = F.log_softmax(x, dim=0)
            return output
    
        def dataloader(self, train=False):
            transform = transforms.Compose([
                transforms.ToTensor(),
                transforms.Normalize(128, 1)
            ])
            dataset = datasets.MNIST('data', train=train, download=True, transform=transform)
            dataloader = torch.utils.data.DataLoader(dataset, batch_size=64, pin_memory=True, shuffle=True, num_workers=1)
            return dataloader
    
        def train_dataloader(self):
            return self.dataloader(train=True)
    
        def val_dataloader(self):
            return self.dataloader(train=False)
    
        def training_step(self, batch, batch_nb):
            x, y = batch
            output = self(x)
            loss = F.nll_loss(output, y)
            acc = accuracy(torch.max(output, dim=1)[1], y)
            self.log('train_loss', loss, on_step=True)
            self.log('train_acc', acc, on_step=True, prog_bar=True)
            return loss
    
        def validation_step(self, batch, batch_nb):
            x, y = batch
            output = self(x)
            loss = F.nll_loss(x, y)
            acc = accuracy(torch.max(output, dim=1)[1], y)
            self.log('val_loss', loss, on_epoch=True, reduce_fx=torch.mean)
            self.log('val_acc', acc, on_epoch=True, reduce_fx=torch.mean)
    
        def configure_optimizers(self):
            optimizer = torch.optim.Adadelta(model.parameters(), lr=1.0)
            scheduler = StepLR(optimizer, step_size=1, gamma=0.7)
            return [optimizer], [scheduler]
    
    
    if __name__ == "__main__":
        model = LitClassifier()
        trainer = pl.Trainer(gpus=1)
        trainer.fit(model)
    

    你能找出这段代码中的所有bug吗?

    Lightning负责处理许多经常导致错误的工程模式:训练、验证和测试循环逻辑、将模型从训练模式切换到eval模式或反之、将数据移动到正确的设备、检查点、日志记录等等。

    Trick 1: 检查验证循环的完整性

    如果我们运行上面的代码,我们会立即得到一条错误消息,说在验证步骤的第65行中大小不匹配。

    ...
    ---> 65         loss = F.nll_loss(x, y)
         66         acc = accuracy(torch.max(output, dim=1)[1], y)
         67         self.log('val_loss', loss, on_epoch=True, 
                    reduce_fx=torch.mean)
    ...RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [64]
    

    如果你注意到了,Lightning在训练开始前运行了两个验证步骤。这不是一个bug,而是一个[特性](https://pytoring-lightning.readthedocs.io/en/stable/debugging.html #设置验证健全步骤的数量)!这实际上为我们节省了大量的时间,否则,如果错误发生在长时间的训练之后,我们就会浪费很多时间。Lightning在开始时检查验证循环,这让我们可以快速修复错误,因为很明显,现在应该读取第65行:

    loss = F.nll_loss(output, y)
    

    就像在训练步骤中一样。

    这是一个很容易解决的问题,因为堆栈跟踪告诉我们哪里出了问题,而且这是一个明显的错误。修正后的代码现在运行没有错误,但如果我们查看进度条中的损失值,我们会发现它停留在2.3。这可能有很多原因:错误的优化器,糟糕的学习率或学习率策略,错误的损失函数,数据的问题等等。

    PyTorch Lightning内置了TensorBoard ,在这个例子中,训练损失和验证损失都没有减少。

    Trick 2: 记录训练数据的直方图

    经常检查输入数据的范围是很重要的。如果模型权重和数据是非常不同的量级,它可能导致没有或非常低的学习进展,并在极端情况下导致数值不稳定。例如,当以错误的顺序应用数据扩充或忘记了归一化时,就会发生这种情况。我们的例子中是这样的吗?我们应该可以通过打印最小值和最大值来找出答案。但是等等!这不是一个好的解决方案,因为它会不必要地污染代码,并且在需要的时候需要花费太多的时间来重复它。更好的方法:写一个回调类来为我们完成它!

    class InputMonitor(pl.Callback):
    
        def on_train_batch_start(self, trainer, pl_module, batch, batch_idx, dataloader_idx):
            if (batch_idx + 1) % trainer.log_every_n_steps == 0:
                x, y = batch
                logger = trainer.logger
                logger.experiment.add_histogram("input", x, global_step=trainer.global_step)
                logger.experiment.add_histogram("target", y, global_step=trainer.global_step)
    
                
    # use the callback like this:
    model = LitClassifier()
    trainer = pl.Trainer(gpus=1, callbacks=[InputMonitor()])
    trainer.fit(model)
    
    一个简单的回调,它将训练数据的直方图记录到TensorBoard中。

    PyTorch Lightning中的回调可以保存可以注入训练器的任意代码。这个在进入训练步骤之前计算输入数据的直方图。将此功能封装到回调类中有以下优点:

    1. 它与你的研究代码是分开的,没有必要修改你的LightningModule!

    2. 它是可移植的,因此可以在未来的项目中重用,并且只需要更改两行代码:导入回调,然后将其传递给Trainer。

    3. 可以通过子类化或与其他回调组合来扩展。

    现在有了新的回调功能,我们可以打开TensorBoard并切换到“直方图”选项卡来检查训练数据的分布情况:

    目标在范围[0,9]中,这是正确的,因为MNIST有10位的类,但是图像的值在-130到-127之间,这是错误的!我们很快发现在第41行归一化中有一个问题:

    transforms.Normalize(128, 1)  # wrong normalization
    

    这两个数字应该是输入数据的平均值和标准差(在我们的例子中,是图像中的像素)。为了解决这个问题,我们添加了真实的平均值和标准差,也命名了参数,以使其更清楚:

    transforms.Normalize(mean=0.1307, std=0.3081)
    

    我们可以查一下这些数字,因为它们是已知的。对于你自己的数据集,你必须自己计算。

    经过归一化处理后,像素点的均值为0,标准差为1,就像分类器的权重一样。我们可以通过看TensorBoard的直方图来确认这一点。

    Trick 3: 在前向传播中检测异常

    在修复了归一化问题之后,我们现在也可以在TensorBoard中得到预期的直方图。但不幸的是,损失仍然没有降低。还是有问题。我知道数据是正确的,开始查找错误的一个好地方是网络的前向路径。一个常见的错误来源是操纵张量形状的操作,如permute、reshape、view、flatten等,或应用于一维的操作,如softmax。当这些函数被应用在错误的尺寸或错误的顺序上时,我们通常会得到一个形状不匹配的错误,但情况并不总是如此!这些bug很难追踪。

    让我们来看看一种技术,它可以让我们快速地检测出这些错误。

    快速检查模型是否在批处理中混合数据。

    想法很简单:如果我们改变第n个输入样本,它应该只对第n个输出有影响。如果其他输出i≠n也发生变化,则模型会混合数据,这就不好了!一个可靠的方法来实现这个测试是计算关于所有输入的第n个输出的梯度。对于所有i≠n(上面动画中为红色),梯度必须为零,对于i = n(上面动画中为绿色),梯度必须为非零。如果满足这些条件,则模型通过了测试。下面是n = 3时的实现:

    # examine the gradient of the n-th minibatch sample w.r.t. all inputs
    n = 3  
    
    # 1. require gradient on input batch
    example_input = torch.rand(5, 1, 28, 28, requires_grad=True)
    
    # 2. run batch through model
    output = model(example_input)
    
    # 3. compute a dummy loss on n-th output sample and back-propagate
    output[n].abs().sum().backward()
    
    # 4. check that gradient on samples i != n are zero!
    # sanity check: if this does not return 0, you have a bug!
    i = 0
    example_input.grad[i].abs().sum().item()
    

    这里是同样的Lightning Callback:

    class CheckBatchGradient(pl.Callback):
        
        def on_train_start(self, trainer, model):
            n = 0
    
            example_input = model.example_input_array.to(model.device)
            example_input.requires_grad = True
    
            model.zero_grad()
            output = model(example_input)
            output[n].abs().sum().backward()
            
            zero_grad_inds = list(range(example_input.size(0)))
            zero_grad_inds.pop(n)
            
            if example_input.grad[zero_grad_inds].abs().sum().item() > 0
                raise RuntimeError("Your model mixes data across the batch dimension!")
                
                
    # use the callback like this:
    model = LitClassifier()
    trainer = pl.Trainer(gpus=1, callbacks=[CheckBatchGradient()])
    trainer.fit(model)
    

    将这个测试应用到LitClassifer上,可以立即发现它混合了数据。现在知道了我们要找的是什么,我们很快就发现了正向传播中的一个错误。第35行中的softmax被应用到了错误的维度上:

    output = F.log_softmax(x, dim=0)
    

    应该是:

    output = F.log_softmax(x, dim=1)
    

    好了,分类器开始工作了!训练和验证损失迅速降低。

    ![](3 Simple Tricks That Will Change the Way You Debug Pytorch.assets/1_1_HWZbn7RkHwKnLutk5kfg.jpeg)

    总结

    编写好的代码从组织开始。PyTorch Lightning通过删除围绕训练循环工程、检查点保存、日志记录等的样板代码来处理这一部分。剩下的是实际的研究代码:模型、优化和数据加载。如果某些东西没有按照我们期望的方式工作,很可能是代码的这三部分中的某一部分有错误。在这篇博文中,我们实现了两个回调,帮助我们1)监控进入模型的数据,2)验证我们网络中的各层不会在批处理维度上混合数据。回调的概念是向现有算法添加任意逻辑的一种非常优雅的方式。一旦实现,就可以通过更改两行代码轻松地将其集成到新项目中。

    END

    英文原文:https://medium.com/@adrian.waelchli/3-simple-tricks-that-will-change-the-way-you-debug-pytorch-5c940aa68b03

    请长按或扫描二维码关注本公众号

    喜欢的话,请给我个在看吧

    展开全文
  • /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType

    报错信息

    RuntimeError: CUDA error: device-side assert triggered
    /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [31,0,0], thread: [100,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
    /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [30,0,0], thread: [162,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
    /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [32,0,0], thread: [290,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
    

    解决

    1. 检查label中是否有-1,或者label中有大于num_classes的数。label更新无误后可解决问题

    2. 其他解决方法,尝试运行时加上:

    CUDA_LAUNCH_BLOCKING=1 python train.py 
    

    联系方式

    公众号搜索:YueTan

    展开全文
  • debug branch pytorch1.1.0

    2020-11-28 15:08:16
    <div><p>How to debug branch pytorch1.1.0 using pycharm or vscode, we known this code use distributed training?</p><p>该提问来源于开源项目:HRNet/HRNet-Semantic-Segmentation</p></div>
  • Debug RefineDet pytorch

    2019-05-15 17:58:01
    https://github.com/lzx1413/PytorchSSD#pytorch-41-is-suppoted-on-branch-04-now 环境配置问题: libstdc++.so.6: version `GLIBCXX_3.4.22' not found 解决: ...

    https://github.com/lzx1413/PytorchSSD#pytorch-41-is-suppoted-on-branch-04-now

    环境配置问题:

    libstdc++.so.6: version `GLIBCXX_3.4.22' not found
    解决:
    https://blog.csdn.net/pursuit_zhangyu/article/details/79450027
    https://blog.csdn.net/u011961856/article/details/79644342
    
    
    from torch._C import * 
    ImportError: numpy.core.multiarray failed to import 
    解决:
    https://github.com/pytorch/pytorch/issues/2731
    安装numpy-1.13.1

    运行问题:

    UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
      targets = [Variable(anno.cuda(),volatile=True) for anno in targets]
    
    解决:
            if args.cuda:
                images = Variable(images.cuda())
                # targets = [Variable(anno.cuda(),volatile=True) for anno in targets]
                targets = [Variable(anno.cuda()) for anno in targets]
            else:
                images = Variable(images)
                # targets = [Variable(anno, volatile=True) for anno in targets]
                targets = [Variable(anno) for anno in targets]
    /PytorchSSD/layers/modules/l2norm.py:17: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
      init.constant(self.weight,self.gamma)
    解决:
        def reset_parameters(self):
            # init.constant(self.weight, self.gamma)
            init.constant_(self.weight, self.gamma)
    UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
      priors = Variable(priorbox.forward(), volatile=True)
    解决:
    # priors = Variable(priorbox.forward(), volatile=True)
    with torch.no_grad():
        priors = Variable(priorbox.forward())
    RuntimeError:one of the variables needed for gradient computation has been modified by an inplace operation
    
    # https://github.com/lzx1413/PytorchSSD/issues/72
    # in layers/modules/l2norm.py      
    # 原因:0.4.0把Varible和Tensor融合为一个Tensor,inplace操作,之前对Varible能用,但现在对Tensor,就会出错了,所以找到模型中所有的inplace操作,换成非inplace的写法就行
        def forward(self, x):
            norm = x.pow(2).sum(dim=1, keepdim=True).sqrt()+self.eps
            # x /= norm
            x = x / norm
            out = self.weight.unsqueeze(0).unsqueeze(2).unsqueeze(3).expand_as(x) * x
            return out
      File "/media/M_fM__VM_0M_eM__JM__M_eM__MM_7/PytorchSSD/refinedet_train_test.py", line 306, in train
        mean_arm_loss_c += arm_loss_c.data[0]
    IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number
    
    解决:
    # mean_arm_loss_c += arm_loss_c.data[0]
    # mean_arm_loss_l += arm_loss_l.data[0]
    # mean_odm_loss_c += odm_loss_c.data[0]
    # mean_odm_loss_l += odm_loss_l.data[0]
    # 修改后
    mean_arm_loss_c += arm_loss_c.item()
    mean_arm_loss_l += arm_loss_l.item()
    mean_odm_loss_c += odm_loss_c.item()
    mean_odm_loss_l += odm_loss_l.item()
      File "/media/M_fM__VM_0M_eM__JM__M_eM__MM_7/PytorchSSD/layers/modules/refine_multibox_loss.py", line 114, in forward
        loss_c[pos] = 0 # filter out pos boxes for now
    IndexError: The shape of the mask [1, 6375] at index 0 does not match the shape of the indexed tensor [6375, 1] at index 0
    解决:
    # loss_c[pos] = 0 # filter out pos boxes for now
    # loss_c = loss_c.view(num, -1)
    # 修改为
    loss_c = loss_c.view(num, -1)
    loss_c[pos] = 0 # filter out pos boxes for now

     

     

    展开全文
  • pytorch debug

    2020-05-18 18:02:24
    PyTorch使用torch.nn.DataParallel进行多GPU训练的一个BUG,已解决
  • debugpytorch CTC_Loss为nan

    千次阅读 2020-04-20 23:12:42
    1. feature中有nan值 有次max_pool2d参数设计错误出现了这种情况 可以通过 print(feature.max()) 看...现在pytorch中有自带的ctcloss其用法 >>> T = 50 # Input sequence length >>> C = 20 ...
  • Pytorch Debug Log

    2019-03-31 18:18:56
    2019 March 31 pytorch Pytorch Debug Log 1. 模型与参数的类型不符 Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same TensorIterator expected type torch.cuda....
  • pytorchDebug pytorch作为一个动态图框架,与ipdb结合能为调试过程带来便捷 对tensorflow等静态图来说,使用python接口定义计算图,然后使用c++代码执行底层运算,在定义图的时候不进行任何计算,而在计算的...
  • pytorchdebug

    2020-11-12 10:46:29
    pytorchdebug真玄幻,GPU上运行时debug会错误定位,听我的,代码出错时一定要切换到CPU上运行,才能正确定位
  • 编译pytorch debug blog

    2020-07-15 14:00:01
    cmake pytorch debug blog 1.use proxy to clone 2.go to readme of pytorch and install from source 3. add export DEBUG=1 4. add export MAX_JOBS=4 5. to clean to the mess, use python setup.py clean
  • Debug PyTorch code using PySnooper
  • pdb:简单的python debug (调试pytorch源码) #准备工作conda create -n pytorch python=3.7 conda activate pytorch #准备工作conda install pytorch torchvision cpuonly -c pytorch conda install pdb 调试的...
  • 本身而言,pytorch并不直接包含from_numpy这个方法,而需要通过_C这样的命名空间来调用。 因此解决方案: 1.用 torch._C.from_numpy 因此利用._C的办法进行测试,果然顺利通过。 2.sudo pip ...
  • 在使用pytorch训练模型的时候,训练时候,内存十分平稳,预测的时候迅速增长,报内存不足的错误 原始是忘记加no_grad 解决方法:在预测的时候,加with torch.no_grad() with torch.no_grad(): for i,batch in ...
  • SSD pytorch 运行debug

    2020-05-15 12:55:47
    出于某些原因需要在一个数据集上测试SSD ...源代码来自github 高star的repos:https://github.com/amdegroot/ssd.pytorch 为了在自己的数据集上进行训练和测试,首先是要按下面这篇blog的流程先修改: ...
  • Pytorch日常debug记录

    2021-01-12 14:42:44
    (debug到一半才想起要写这么个记录......) 2021/1/12 1. ValueError: At least one stride in the given numpy array is negative, and tensors with negative strides are not currently supported. (You can ...
  • Pytorch设置默认CUDA不是0号 问题说明 实验室服务器有4个GPU,同学用了0号,0号显存基本占满,剩余3个GPU,在代码里只使用了CUDA:3,但是cuda总是说memory不足。 报错: cuda runtime error (77): an illegal memory...
  • pytorch P2B Debug

    2020-12-15 17:52:29
    Point-to-Box Network for 3D Object Tracking in Point Clouds踩坑。 实验室的网配个conda虚拟环境配了一年,conda创建虚拟环境,按github仓库...git+https://github.com/v-wewei/etw_pytorch_utils.git@v1.1.1#egg=e
  • 我们对Pytorchdebug一般都是在python端进行,这对于一般搭建模型的任务来说足够了。但如果我们需要对Pytorch进行一些修改或者研究一下机器或深度学习系统是如何搭建的,想要深入探索就必须涉及到C++的源码层面。 ...
  • pytorch-debug-1

    2020-08-03 18:58:52
    RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation 报错信息为一个inpalce操作和梯度冲突了 1.什么是inplace操作? inplace操作即用该变量更新该...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 874
精华内容 349
关键字:

debugpytorch