I just started using wandb, and I wanted to train two models over the weekend on 1 GPU, but after a while one of them crashed due to lack of memory. I reduced the val batch size then added resume=true to the call to wandb.init and things started progressing. Checking in over the weekend I saw that only one run was “running”, the other was “crashed”. I went to look at the actual terminal session where I launched the jobs, and they were both still running.
At this point I had 2 runs under my project, as I’d deleted all the previous failed attempts. I assumed I’d accidentally deleted the wrong run from the UI, but when I looked at the graphs I saw that training accuracy and loss went backwards at one epoch.
This shouldn’t happen, so I went to look at the logs for the runs in MLFlow (still using it as I try out wandb) and the train accuracy for both runs was monotonically increasing. Looking closer at the actual values and the logs, I think both runs are submitting values to the same “run”. The graph was saying accuracy was 0.973 at epoch 6, and 0.9711 at epoch 7. Looking at my terminal logs for the most recent epoch for each run, I saw:
Scrolling up to the top of each log, I see both are using runs/ajydp67n. I’m guessing this is because I didn’t specify anything other than config when calling init, does wandb not disambiguate runs based on the value of config?
Potentially unrelated, I saw logs re: saving my model:
wandb: ERROR Can't save model in the h5py format. The model will be saved as W&B Artifacts in the SavedModel format.
WARNING:absl:Function `_wrapped_model` contains input name(s) args_0 with unsupported characters which will be renamed to args_0_2 in the SavedModel.
WARNING:absl:Found untraced functions such as _precision, _recall, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op while saving (showing 5 of 22). These functions will not be directly callable after loading.
wandb: Adding directory to artifact (/.../wandb/run-20220717_000101-ajydp67n/files/model-best)... Done. 0.1s
but when I look, there’s nothing called “model-best” in wandb/run-20220717_000101-ajydp67n/files. There are a bunch of logs there, but nothing that looks like a serialized model.
Reading the docs it seems like this is expected behaviour, still annoying though. I feel like at least wandb should crash if trying to join a run that is already “running”, and even better would be to automatically discover the run-ID to resume based on config values.
We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.