Every couple of hours I silently get this error during training then checkpoints silently fail to save from then on. I’ve now lost >30 hrs of training because of this weird issue. Does anyone know what’s causing this and how to fix it?
for future searchers:
NVMLError_OperatingSystem: The operating system has blocked the request.
Hi @zaptrem thank you for reporting this. We will need a bit more context here on your setup both hardware and environments. Is this a WSL terminal? there might be some issues with NVML and WSL.
Also wanted to clarify, did it not log anything from epoch 1876 to 2392? would you be interested in running the experiments in offline mode and write the data to your disk and sync to W&B afterwards?
Hey @zaptrem I wanted to follow up with you regarding this issue, could you provide some more information asked above to help debug this? Please let me know if I can be of further assistance or if your issue has been resolved.
Hi @zaptrem as we haven’t heard back from you, we are going to close the ticket for now. Please feel free to message us here if the issue hasn’t been resolved for you, and we will be happy to keep investigating!