I was working with wandb to track my experiment.
The experiment run on the GKE cluster by using MLFlow Projects.
But, from several weeks ago (I suggest it’s around the release of wandb 0.15.0), I found that my training job doesn’t exit just after it finished the traininig job. It finished almost after 24 hours.
I didn’t mkae breaking change in my code. So I’m suspecting whether there is discrepency on this situation.
Because of that reason, I started to fix version of wandb to be 0.14.2 rather than 0.15.0.
Can I get the help?
Here is the last log from the running process.
wandb: Waiting for W&B process to finish… (success).
wandb: Network error (ReadTimeout), entering retry loop.