Hi,
I met with a debug error when tuning hyperparams with sweep.
wandb: ERROR Run c3yfj87h errored: RuntimeError('cuDNN error: CUDNN_STATUS_INTERNAL_ERROR')
wandb: ERROR Run 542e421i errored: RuntimeError('false INTERNAL ASSERT FAILED at "../c10/cuda/CUDAGraphsC10Utils.h":73, please report a bug to PyTorch. Unknown CUDA graph CaptureStatus32522')
When I directly run it with a terminal, there is no such error. It only occurs when I debug. Could someone give some clues about the reason why?
According to a Stack Overflow post, the error RuntimeError('cuDNN error: CUDNN_STATUS_INTERNAL_ERROR') normally indicates that this is an out of memory problem.
This is likely this is an issue with PyTorch but there may be information in the debug logs of the run. They should be located in the wandb folder in the same directory as where the script was run. The wandb folder has folders formatted as run-DATETIME-ID associated with a single run. Could you retrieve the debug.log and debug-internal.log files from one of these folders specifically from the run that is having issues?
Hi,
sorry for delay of my reply. Unfortunately I cannot remember which run is related with this issue. For now I have not seen this error for several days. If the same problem occurrs again, I will turn to here then. Thanks for your help