Open
Description
π Bug
When using CUDA, getting non-descriptive runtime error, which makes any future calls to CUDA fail, and requires runtime restart. When not using CUDA, error is nice and descriptive, and doesn't break the runtime.
To Reproduce
Steps to reproduce the behavior:
import torch
prob = torch.tensor([1.01]).to('cuda:0')
torch.distributions.bernoulli.Bernoulli(probs=prob).sample([1])
This results in:
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1591914743399/work/aten/src/THC/THCReduceAll.cuh:327
Any subsequent calls to CUDA (even with valid instructions), result in:
RuntimeError: CUDA error: device-side assert triggered
Expected behavior
This is a nice descriptive error, which one gets if not using CUDA.
RuntimeError: Expected p_in >= 0 && p_in <= 1 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
Environment
Collecting environment information...
PyTorch version: 1.5.1
Is debug build: No
CUDA used to build PyTorch: 9.2
OS: Debian GNU/Linux 10 (buster)
GCC version: (Debian 8.3.0-6) 8.3.0
CMake version: Could not collect
Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.5.1
[pip3] torchvision==0.6.0a0+35d732a
[conda] blas 1.0 mkl
[conda] cudatoolkit 9.2 0
[conda] mkl 2020.1 217
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.1.0 py38h23d657b_0
[conda] mkl_random 1.1.1 py38h0573a6f_0
[conda] numpy 1.19.1 pypi_0 pypi
[conda] pytorch 1.5.1 py3.8_cuda9.2.148_cudnn7.6.3_0 pytorch
[conda] torchvision 0.6.1 py38_cu92 pytorch
Additional context
Not sure how much of an actual issue this is, since any additional checks on parameters would result in slower execution - and it takes one second to try same code without CUDA to identify the issue. Also this has been mentioned several times in other tickets, and it just seems that CUDA errors aren't as descriptive.
cc @ngimel