Remote I/O error at random point during pytorch training


During a training run for a pytorch model wandb is reporting ‘Remote I/O’ errors at random points during training, usually withing the first 5 minutes. This causes the run to crash, and repeatedly prints

— Logging Error —

to stout.

The issue started after a restart of the machine and me changing wandb accounts (to an account using my institution email to be able to collaborate with my colleagues).

I have since forced wandb to log me in again with wandb login --relogin, switched accounts and completely purged and reinstalled wandb on the system (unless I missed a configuration directory that I am not aware of).

On my local machine using the same wandb account the script runs fine.
Since the runs are part of a hyperparameter sweep I cannot easily disable wandb.

Any help with this would be greatly appreachiated! Thanks for the time and effort you put into wandb! It has been a great tool so far

My working theory is that the issue seems to occur the first time logs are flushed to disk.

Relevant strack trace:

OSError: [Errno 121] Remote I/O error
Call stack:
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/”, line 966, in _bootstrap
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/”, line 1009, in _bootstrap_inner
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/”, line 49, in run
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/”, line 100, in _run
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/”, line 279, in _process
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/”, line 136, in handle
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/”, line 144, in handle_request
logger.debug(f"handle_request: {request_type}")
Message: ‘handle_request: partial_history’
Arguments: ()
— Logging error —
Traceback (most recent call last):
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/logging/”, line 1102, in emit
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/logging/”, line 1082, in flush

Hi @danielanthes, could you possibly send the debug-internal.log file from the local run folder to and I can take a look?

Also, there is login information stored at ~/.config/wandb/settings as well which may be worth deleting before you attempt to login again.

Hi Nate,

Hi Nate,
Thanks for your reply! The problem ended up being caused by a drive that did not get mounted correctly after rebooting the node. Would the error log still be helpful to you? In that case I’ll gladly send it.

Hi @danielanthes,

Hi @danielanthes,
Glad you were able to get this resolved. I think we are all set then and this makes sense to me. Let us know if you run into any other issues.

Thank you,

