I find that my scripts seem to halt on their own but they seem to deadlock or don’t throw an error e.g. I was running a training script on my laptop but cuz it was on debug mode I was able to pause and it seemed to be stuck with some multiprocessing things and it seemed it was related to wandb…
epoch_num=95: train_loss=1.861583555999555, train_acc=0.4987407624721527
epoch_num=95: val_loss=tensor(7.3504), val_acc=tensor(0.)
16% (96 of 600) | | Elapsed Time: 7:47:13 | ETA: 1 day, 16:52:54 | 175.7 s/it
epoch_num=96: train_loss=1.8501708821246499, train_acc=0.5018503069877625
epoch_num=96: val_loss=tensor(6.9187), val_acc=tensor(0.)
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/Users/brandomiranda/opt/anaconda3/envs/meta_learning/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/Users/brandomiranda/opt/anaconda3/envs/meta_learning/lib/python3.9/multiprocessing/synchronize.py", line 88, in _cleanup
unregister(name, "semaphore")
File "/Users/brandomiranda/opt/anaconda3/envs/meta_learning/lib/python3.9/multiprocessing/resource_tracker.py", line 151, in unregister
self._send('UNREGISTER', name, rtype)
File "/Users/brandomiranda/opt/anaconda3/envs/meta_learning/lib/python3.9/multiprocessing/resource_tracker.py", line 154, in _send
self.ensure_running()
File "/Users/brandomiranda/opt/anaconda3/envs/meta_learning/lib/python3.9/multiprocessing/resource_tracker.py", line 75, in ensure_running
with self._lock:
KeyboardInterrupt:
does wandb have some deadlock bug if it is ran for too long for a reallllyyyyyy long experiment?