I’m running finetuning jobs in the cloud using the runpod.io service and I use wandb to log training and evaluation metrics as well as to save the model checklpoints at the end of the finetuning process.
The logging is notoriously non-deterministic - all charts and events might be logged fine in one run, and then be completely missing in the next.
In most runs though a random selection of charts is being logged. For example, if I have 3 metrics A, B and C, in run one only A will be logged, in run 2 A and C might be logged, in run 3 only B might be logged - and so on and so forth.
The same phenomenon applies to the artifacts - sometimes they are logged, sometimes they aren’t. It’s a real guesswork.
This situation makes relying on wandb in our projects really difficult. I would like to ask for your help in understanding what might be causing this and if it’s a configuration issue, what should I do to configure the job properly.
I’m happy to share how my stack is configured and grant you access to the code, but TL;DR:
- I’m using Huggingface transformers.Trainer for finetuning. It logs all training metrics for me. The logging frequency for those has been set.
- I’m using proprietary Huggingface transformers.TrainerCallbacks to log evaluation metrics and the artifacts
- I’m running the training job out of a docker container from a pod running in runpod.io
Please let me know what details you would require to assist me with this problem.