I’m logging several metrics which, by virtue of my training data/procedure, are occasionally undefined during training and produce NaNs.
The problem is as follows. For on_step level logging, I am able to see each step where a number was logged or a NaN was logged. However, for on_epoch level logging, the metrics are always reported as NaNs. I think this is because the on_epoch level logging is doing a mean of the recorded values over the epoch, and some of these are NaNs.
Is there some way to specify a nanmean aggregation operation when computing epoch level metrics?
Are you using the W&B Pytorch Lightning Integration? If this is the case, then wandb will simply take the values that PTL provides it. You are correct in that PTL does log an aggregation when you enable on_epoch logging and thus logging NaNs everytime you log an epoch. You will have to define your own custom reduction as explained in PTL’s documentation in order to log a different aggregation.
Hi Jonathan, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!