RuntimeError: max must be larger than min SCALER

Hi all,
I have this weird Runtime error during training @ epoch 129.

Traceback (most recent call last):
File “/home/anton/Documents/GitHub/horse2depth_Pix2Pix/train_depth_loss.py”, line 715, in
File “/home/anton/Documents/GitHub/horse2depth_Pix2Pix/train_depth_loss.py”, line 630, in main
File “/home/anton/Documents/GitHub/horse2depth_Pix2Pix/train_depth_loss.py”, line 315, in train_fn
# g_scaler.scale(G_loss).backward()
File “/usr/anaconda3/envs/CGAN/lib/python3.10/site-packages/torch/_tensor.py”, line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/usr/anaconda3/envs/CGAN/lib/python3.10/site-packages/torch/autograd/init.py”, line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File “/usr/anaconda3/envs/CGAN/lib/python3.10/site-packages/wandb/wandb_torch.py”, line 264, in
handle = var.register_hook(lambda grad: _callback(grad, log_track))
File “/usr/anaconda3/envs/CGAN/lib/python3.10/site-packages/wandb/wandb_torch.py”, line 262, in _callback
self.log_tensor_stats(grad.data, name)
File “/usr/anaconda3/envs/CGAN/lib/python3.10/site-packages/wandb/wandb_torch.py”, line 213, in log_tensor_stats
tensor = flat.histc(bins=self._num_bins, min=tmin, max=tmax)
RuntimeError: max must be larger than min

First time it happened.

Any help?

Thanks

Hi @aa_technion ,

It sounds to me like you may be encountering an exploding or vanishing gradient which could be leading to overflow / underflow issues. Here are some debugging steps I can suggest.

  • Ensure that you’re calling optimizer.zero_grad() before each batch
  • Try normalizing the weights and inputs
  • Try implementing gradient clipping.
  • Set wandb.watch(log=None), and if your train loss becomes NaN, should be addresses by normalizing the data.

Please let me know if any of these work for you. If they don’t:

  • Provide code example in the form of a colab for us to attempt to reproduce your specific issue.
  • Additionally include the run debug logs (debug.log and debug-internal.log) for the runs that error our. They are located in wandb/run-DATETIME-ID/ logs relative to your working directory,

Thank-you,

HI @aa_technion , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!