I’m logging several metrics which, by virtue of my training data/procedure, are occasionally undefined during training and produce NaNs.
The problem is as follows. For
on_step level logging, I am able to see each step where a number was logged or a NaN was logged. However, for
on_epoch level logging, the metrics are always reported as NaNs. I think this is because the
on_epoch level logging is doing a mean of the recorded values over the epoch, and some of these are NaNs.
Is there some way to specify a
nanmean aggregation operation when computing epoch level metrics?
Hello @jonathanking !
Are you using the W&B Pytorch Lightning Integration? If this is the case, then
wandb will simply take the values that PTL provides it. You are correct in that PTL does log an aggregation when you enable
on_epoch logging and thus logging
NaNs everytime you log an epoch. You will have to define your own custom reduction as explained in PTL’s documentation in order to log a different aggregation.
Hi Jonathan, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.