Remote I/O error at random point during pytorch training

Hello,

During a training run for a pytorch model wandb is reporting ‘Remote I/O’ errors at random points during training, usually withing the first 5 minutes. This causes the run to crash, and repeatedly prints

— Logging Error —

to stout.

The issue started after a restart of the machine and me changing wandb accounts (to an account using my institution email to be able to collaborate with my colleagues).

I have since forced wandb to log me in again with wandb login --relogin, switched accounts and completely purged and reinstalled wandb on the system (unless I missed a configuration directory that I am not aware of).

On my local machine using the same wandb account the script runs fine.
Since the runs are part of a hyperparameter sweep I cannot easily disable wandb.

Any help with this would be greatly appreachiated! Thanks for the time and effort you put into wandb! It has been a great tool so far

My working theory is that the issue seems to occur the first time logs are flushed to disk.

Relevant strack trace:

OSError: [Errno 121] Remote I/O error
Call stack:
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/threading.py”, line 966, in _bootstrap
self._bootstrap_inner()
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/threading.py”, line 1009, in _bootstrap_inner
self.run()
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py”, line 49, in run
self._run()
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py”, line 100, in _run
self._process(record)
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/internal.py”, line 279, in _process
self._hm.handle(record)
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/handler.py”, line 136, in handle
handler(record)
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/handler.py”, line 144, in handle_request
logger.debug(f"handle_request: {request_type}")
Message: ‘handle_request: partial_history’
Arguments: ()
— Logging error —
Traceback (most recent call last):
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/logging/init.py”, line 1102, in emit
self.flush()
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/logging/init.py”, line 1082, in flush
self.stream.flush()

Hi @danielanthes, could you possibly send the debug-internal.log file from the local run folder to Nathan.kuneman@wandb.com and I can take a look?

Also, there is login information stored at ~/.config/wandb/settings as well which may be worth deleting before you attempt to login again.

Thank you,
Nate

Hi Nate,
Thanks for your reply! The problem ended up being caused by a drive that did not get mounted correctly after rebooting the node. Would the error log still be helpful to you? In that case I’ll gladly send it.

All the best,
Daniel

Hi @danielanthes,
Glad you were able to get this resolved. I think we are all set then and this makes sense to me. Let us know if you run into any other issues.

Thank you,
Nate

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.