Remote I/O error at random point during pytorch training

danielanthes · August 30, 2023, 8:57am

Hello,

During a training run for a pytorch model wandb is reporting ‘Remote I/O’ errors at random points during training, usually withing the first 5 minutes. This causes the run to crash, and repeatedly prints

— Logging Error —

to stout.

The issue started after a restart of the machine and me changing wandb accounts (to an account using my institution email to be able to collaborate with my colleagues).

I have since forced wandb to log me in again with wandb login --relogin, switched accounts and completely purged and reinstalled wandb on the system (unless I missed a configuration directory that I am not aware of).

On my local machine using the same wandb account the script runs fine.
Since the runs are part of a hyperparameter sweep I cannot easily disable wandb.

Any help with this would be greatly appreachiated! Thanks for the time and effort you put into wandb! It has been a great tool so far

My working theory is that the issue seems to occur the first time logs are flushed to disk.

Relevant strack trace:

OSError: [Errno 121] Remote I/O error
Call stack:
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/threading.py”, line 966, in _bootstrap
self._bootstrap_inner()
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/threading.py”, line 1009, in _bootstrap_inner
self.run()
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py”, line 49, in run
self._run()
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py”, line 100, in _run
self._process(record)
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/internal.py”, line 279, in _process
self._hm.handle(record)
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/handler.py”, line 136, in handle
handler(record)
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/handler.py”, line 144, in handle_request
logger.debug(f"handle_request: {request_type}")
Message: ‘handle_request: partial_history’
Arguments: ()
— Logging error —
Traceback (most recent call last):
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/logging/init.py”, line 1102, in emit
self.flush()
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/logging/init.py”, line 1082, in flush
self.stream.flush()

nathank · September 5, 2023, 12:01am

Hi @danielanthes, could you possibly send the debug-internal.log file from the local run folder to Nathan.kuneman@wandb.com and I can take a look?

Also, there is login information stored at ~/.config/wandb/settings as well which may be worth deleting before you attempt to login again.

Thank you,
Nate

danielanthes · September 7, 2023, 7:48pm

Hi Nate,
Thanks for your reply! The problem ended up being caused by a drive that did not get mounted correctly after rebooting the node. Would the error log still be helpful to you? In that case I’ll gladly send it.

All the best,
Daniel

nathank · September 12, 2023, 2:35pm

Hi @danielanthes,
Glad you were able to get this resolved. I think we are all set then and this makes sense to me. Let us know if you run into any other issues.

Thank you,
Nate

system · November 11, 2023, 2:36pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OSError: Input/output error W&B Help wandb	4	1024	March 14, 2023
ConnectionResetError appearing while i am training due to wandb run W&B Help wandb	4	630	December 5, 2022
Wandb: ERROR Internal wandb error: file data was not synced wandb: ERROR transport failed W&B Help	6	1473	January 18, 2024
Wandb: 429 encountered (Filestream rate limit exceeded, retrying in 73.2 seconds.), retrying request W&B Help wandb	3	89	August 19, 2024
Sync issue after training W&B Help wandb	6	237	August 20, 2024

Remote I/O error at random point during pytorch training

Related topics