Hello,
During a training run for a pytorch model wandb is reporting ‘Remote I/O’ errors at random points during training, usually withing the first 5 minutes. This causes the run to crash, and repeatedly prints
— Logging Error —
to stout.
The issue started after a restart of the machine and me changing wandb accounts (to an account using my institution email to be able to collaborate with my colleagues).
I have since forced wandb to log me in again with wandb login --relogin, switched accounts and completely purged and reinstalled wandb on the system (unless I missed a configuration directory that I am not aware of).
On my local machine using the same wandb account the script runs fine.
Since the runs are part of a hyperparameter sweep I cannot easily disable wandb.
Any help with this would be greatly appreachiated! Thanks for the time and effort you put into wandb! It has been a great tool so far
My working theory is that the issue seems to occur the first time logs are flushed to disk.
Relevant strack trace:
OSError: [Errno 121] Remote I/O error
Call stack:
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/threading.py”, line 966, in _bootstrap
self._bootstrap_inner()
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/threading.py”, line 1009, in _bootstrap_inner
self.run()
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py”, line 49, in run
self._run()
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py”, line 100, in _run
self._process(record)
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/internal.py”, line 279, in _process
self._hm.handle(record)
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/handler.py”, line 136, in handle
handler(record)
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/site-packages/wandb/sdk/internal/handler.py”, line 144, in handle_request
logger.debug(f"handle_request: {request_type}")
Message: ‘handle_request: partial_history’
Arguments: ()
— Logging error —
Traceback (most recent call last):
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/logging/init.py”, line 1102, in emit
self.flush()
File “/home/staff/d/danthes/miniconda3/envs/avalanche-h100/lib/python3.10/logging/init.py”, line 1082, in flush
self.stream.flush()