Chart and artifact logging is very non-deterministic when the process is running in the cloud

ptrochim · November 20, 2023, 11:28am

I’m running finetuning jobs in the cloud using the runpod.io service and I use wandb to log training and evaluation metrics as well as to save the model checklpoints at the end of the finetuning process.

The logging is notoriously non-deterministic - all charts and events might be logged fine in one run, and then be completely missing in the next.
In most runs though a random selection of charts is being logged. For example, if I have 3 metrics A, B and C, in run one only A will be logged, in run 2 A and C might be logged, in run 3 only B might be logged - and so on and so forth.
The same phenomenon applies to the artifacts - sometimes they are logged, sometimes they aren’t. It’s a real guesswork.

This situation makes relying on wandb in our projects really difficult. I would like to ask for your help in understanding what might be causing this and if it’s a configuration issue, what should I do to configure the job properly.

I’m happy to share how my stack is configured and grant you access to the code, but TL;DR:

I’m using Huggingface transformers.Trainer for finetuning. It logs all training metrics for me. The logging frequency for those has been set.
I’m using proprietary Huggingface transformers.TrainerCallbacks to log evaluation metrics and the artifacts
I’m running the training job out of a docker container from a pod running in runpod.io

Please let me know what details you would require to assist me with this problem.

Kind regards,
Piotr Trochim

nathank · November 22, 2023, 4:21pm

Hi @ptrochim,
Could you share a link to a few a runs where you are seeing different logging behavior and I can take a look?

Also, if you are able to share a minimal reproduction of your code that we can use to test with I can try to reproduce this on my side.

Thank you,
Nate

nathank · December 4, 2023, 11:00pm

Hi @ptrochim, I wanted to follow up and see if this was still an issue?

ptrochim · December 4, 2023, 11:16pm

Hi Nathan,

Turns out this was caused by a bug in my code.
Thank you for following up, let’s close the thread.
Piotr

system · February 2, 2024, 11:16pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How do I do probabilistic logging? W&B Help	3	295	September 20, 2022
Wandb stops uploading data W&B Help wandb	19	1733	February 29, 2024
Horrible performance when viewing charts for WandB run W&B Help dashboard , wandb , pytorch	4	720	April 6, 2023
The runs charts changed after run finished W&B Help dashboard	9	65	September 24, 2024
Local runs are not being updated to server W&B Help wandb	4	716	June 27, 2022

Chart and artifact logging is very non-deterministic when the process is running in the cloud

Related topics