Chart and artifact logging is very non-deterministic when the process is running in the cloud

I’m running finetuning jobs in the cloud using the runpod.io service and I use wandb to log training and evaluation metrics as well as to save the model checklpoints at the end of the finetuning process.

The logging is notoriously non-deterministic - all charts and events might be logged fine in one run, and then be completely missing in the next.
In most runs though a random selection of charts is being logged. For example, if I have 3 metrics A, B and C, in run one only A will be logged, in run 2 A and C might be logged, in run 3 only B might be logged - and so on and so forth.
The same phenomenon applies to the artifacts - sometimes they are logged, sometimes they aren’t. It’s a real guesswork.

This situation makes relying on wandb in our projects really difficult. I would like to ask for your help in understanding what might be causing this and if it’s a configuration issue, what should I do to configure the job properly.

I’m happy to share how my stack is configured and grant you access to the code, but TL;DR:

  • I’m using Huggingface transformers.Trainer for finetuning. It logs all training metrics for me. The logging frequency for those has been set.
  • I’m using proprietary Huggingface transformers.TrainerCallbacks to log evaluation metrics and the artifacts
  • I’m running the training job out of a docker container from a pod running in runpod.io

Please let me know what details you would require to assist me with this problem.

Kind regards,
Piotr Trochim

Hi @ptrochim,
Could you share a link to a few a runs where you are seeing different logging behavior and I can take a look?

Also, if you are able to share a minimal reproduction of your code that we can use to test with I can try to reproduce this on my side.

Thank you,
Nate

Hi @ptrochim, I wanted to follow up and see if this was still an issue?

Hi Nathan,

Turns out this was caused by a bug in my code.
Thank you for following up, let’s close the thread.
Piotr

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.