Hi there!
In my use case I’m running a training loop and storing a model at a regular interval. During training, I also want to evaluate metrics such as the CIDEr score for image captioning. The problem is, computing these metrics takes a lot of time (~40 minutes), and the training is running on a cluster where I can’t evaluate the metrics for several reasons.
So my plan is to load the stored models on a separate machine after every update, and evaluate the metrics there. Once done, I would like to log the metrics to the ongoing training runs, with a step parameter set to the time when the model was stored. So by the time the evaluation is finished, the training runs will have progressed in steps.
Is this possible using the wandb api, without getting concurrency problems?
Thanks!