Remote I/O error at random point during pytorch training


During a training run for a pytorch model wandb is reporting ‘Remote I/O’ errors at random points during training, usually withing the first 5 minutes. This causes the run to crash, and repeatedly prints

— Logging Error —

to stout.

The issue started after a restart of the machine and me changing wandb accounts (to an account using my institution email to be able to collaborate with my colleagues).

I have since forced wandb to log me in again with wandb login --relogin, switched accounts and completely purged and reinstalled wandb on the system (unless I missed a configuration directory that I am not aware of).

On my local machine using the same wandb account the script runs fine.
Since the runs are part of a hyperparameter sweep I cannot easily disable wandb.

Any help with this would be greatly appreachiated! Thanks for the time and effort you put into wandb! It has been a great tool so far

My working theory is that the issue seems to occur the first time logs are flushed to disk.

Relevant strack trace:

OSError: [Errno 121] Remote I/O error
Call stack:
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/”, line 966, in _bootstrap
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/”, line 1009, in _bootstrap_inner
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/”, line 49, in run
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/”, line 100, in _run
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/”, line 279, in _process
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/”, line 136, in handle
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/”, line 144, in handle_request
logger.debug(f"handle_request: {request_type}")
Message: ‘handle_request: partial_history’
Arguments: ()
— Logging error —
Traceback (most recent call last):
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/logging/”, line 1102, in emit
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/logging/”, line 1082, in flush

Hi @danielanthes, could you possibly send the debug-internal.log file from the local run folder to and I can take a look?

Also, there is login information stored at ~/.config/wandb/settings as well which may be worth deleting before you attempt to login again.

Thank you,

Hi Nate,
Thanks for your reply! The problem ended up being caused by a drive that did not get mounted correctly after rebooting the node. Would the error log still be helpful to you? In that case I’ll gladly send it.

All the best,

Hi @danielanthes,
Glad you were able to get this resolved. I think we are all set then and this makes sense to me. Let us know if you run into any other issues.

Thank you,

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.