Does wandb have a limit on how long it can be run and deadlocks?

I find that my scripts seem to halt on their own but they seem to deadlock or don’t throw an error e.g. I was running a training script on my laptop but cuz it was on debug mode I was able to pause and it seemed to be stuck with some multiprocessing things and it seemed it was related to wandb…

epoch_num=95: train_loss=1.861583555999555, train_acc=0.4987407624721527
epoch_num=95: val_loss=tensor(7.3504), val_acc=tensor(0.)
 16% (96 of 600) | | Elapsed Time: 7:47:13 | ETA:  1 day, 16:52:54 | 175.7 s/it
epoch_num=96: train_loss=1.8501708821246499, train_acc=0.5018503069877625
epoch_num=96: val_loss=tensor(6.9187), val_acc=tensor(0.)
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/Users/brandomiranda/opt/anaconda3/envs/meta_learning/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/Users/brandomiranda/opt/anaconda3/envs/meta_learning/lib/python3.9/multiprocessing/synchronize.py", line 88, in _cleanup
    unregister(name, "semaphore")
  File "/Users/brandomiranda/opt/anaconda3/envs/meta_learning/lib/python3.9/multiprocessing/resource_tracker.py", line 151, in unregister
    self._send('UNREGISTER', name, rtype)
  File "/Users/brandomiranda/opt/anaconda3/envs/meta_learning/lib/python3.9/multiprocessing/resource_tracker.py", line 154, in _send
    self.ensure_running()
  File "/Users/brandomiranda/opt/anaconda3/envs/meta_learning/lib/python3.9/multiprocessing/resource_tracker.py", line 75, in ensure_running
    with self._lock:
KeyboardInterrupt: 

does wandb have some deadlock bug if it is ran for too long for a reallllyyyyyy long experiment?

Hey Brando,

There are no known deadlocks in our code as of now. Could you share the debug.log and debug-internal.log associated to this run? It can be found in the wandb folder relative to your project folder.

Additionally, could you share the version of wandb that you are using and the duration of time for which you were running the experiment?

Thanks,
Ramit

Hey Brando,

I wanted to follow up here since we haven’t heard back from you. Is this still an issue you are having trouble with? Please let us know if we can be of further assistance.

Thanks,
Ramit

Hi Brando, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.