Why is min and max causing errors when logging gradients for biases in model?

Why is this error happening when wandb is logging the grad.data field?

  File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 360, in <module>
    main(args)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 333, in main
    meta_train_fixed_iterations_full_epoch_possible(args)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/meta_learning/training/meta_training.py", line 216, in meta_train_fixed_iterations_full_epoch_possible
    log_train_val_stats(args, args.it, train_loss, train_acc, valid=meta_eval, bar=bar_it,
  File "/home/miranda9/automl-meta-learning/automl-proj-src/meta_learning/training/meta_training.py", line 129, in log_train_val_stats
    val_loss, val_acc = valid(args, save_val_ckpt=save_val_ckpt)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/meta_learning/training/meta_training.py", line 274, in meta_eval
    eval_loss, eval_acc = args.meta_learner(spt_x, spt_y, qry_x, qry_y)
  File "/home/miranda9/miniconda3/envs/metalearning_cpu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/meta_learning/meta_learners/maml_meta_learner.py", line 159, in forward
    (qry_loss_t / meta_batch_size).backward()  # note this is more memory efficient (as it removes intermediate data that used to be needed since backward has already been called)
  File "/home/miranda9/miniconda3/envs/metalearning_cpu/lib/python3.9/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/miranda9/miniconda3/envs/metalearning_cpu/lib/python3.9/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
  File "/home/miranda9/miniconda3/envs/metalearning_cpu/lib/python3.9/site-packages/wandb/wandb_torch.py", line 285, in <lambda>
    handle = var.register_hook(lambda grad: _callback(grad, log_track))
  File "/home/miranda9/miniconda3/envs/metalearning_cpu/lib/python3.9/site-packages/wandb/wandb_torch.py", line 283, in _callback
    self.log_tensor_stats(grad.data, name)
  File "/home/miranda9/miniconda3/envs/metalearning_cpu/lib/python3.9/site-packages/wandb/wandb_torch.py", line 235, in log_tensor_stats
    tensor = flat.histc(bins=self._num_bins, min=tmin, max=tmax)
RuntimeError: max must be larger than min

I am not doing anything myself so I am unsure how I can fix this…

1 Like

Hmm, this appears to be a new bug. We fixed a similar RunTimeError a while back, so it’s surprising to me that this is happening. For now, turning off gradient logging in watch (set wandb.watch(log=None)) will prevent this from triggering.

I suggest you raise this issue with W&B Support via the small gray bubble in the bottom-right of the screen on https://wandb.ai. If you don’t see a bubble, turn off any ad-blockers or email support@wandb.com.

1 Like

I get the same error message.

DcnPool553Model(
  (encoder): DcnPool553ModelEncoder(
    (conv1): Sequential(
      (0): Conv3d(1, 32, kernel_size=(5, 5, 5), stride=(1, 1, 1))
      (1): LeakyReLU(negative_slope=0.01)
    )
    (conv2): Sequential(
      (0): Conv3d(32, 64, kernel_size=(5, 5, 5), stride=(1, 1, 1))
      (1): LeakyReLU(negative_slope=0.01)
    )
    (pool1): MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=0, dilation=1, ceil_mode=False)
    (conv3): Sequential(
      (0): DeformConv3d(
        (zero_padding): ConstantPad3d(padding=(0, 0, 0, 0, 0, 0), value=0)
        (conv_kernel): Conv3d(1728, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1))
        (offset_conv_kernel): Conv3d(64, 81, kernel_size=(3, 3, 3), stride=(1, 1, 1))
      )
      (1): LeakyReLU(negative_slope=0.01)
    )
    (pool2): MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=0, dilation=1, ceil_mode=False)
    (fc1): Linear(in_features=3456, out_features=128, bias=True)
    (relu): LeakyReLU(negative_slope=0.01)
    (norm): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (drop): Dropout(p=0.15, inplace=False)
    (fc2): Linear(in_features=128, out_features=128, bias=True)
  )
  (decoder): DcnPool553ModelDecoder(
    (fc2): Linear(in_features=128, out_features=128, bias=True)
    (drop): Dropout(p=0.15, inplace=False)
    (norm): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): LeakyReLU(negative_slope=0.01)
    (fc1): Linear(in_features=128, out_features=3456, bias=True)
    (pool2): MaxUnpool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=(0, 0, 0))
    (conv3): Sequential(
      (0): LeakyReLU(negative_slope=0.01)
      (1): DeformConv3d(
        (zero_padding): ConstantPad3d(padding=(2, 2, 2, 2, 2, 2), value=0)
        (conv_kernel): Conv3d(3456, 64, kernel_size=(1, 1, 1), stride=(1, 1, 1))
        (offset_conv_kernel): Conv3d(128, 81, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(2, 2, 2))
      )
    )
    (pool1): MaxUnpool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=(0, 0, 0))
    (conv2): Sequential(
      (0): LeakyReLU(negative_slope=0.01)
      (1): ConvTranspose3d(64, 32, kernel_size=(5, 5, 5), stride=(1, 1, 1))
    )
    (conv1): Sequential(
      (0): LeakyReLU(negative_slope=0.01)
      (1): ConvTranspose3d(32, 1, kernel_size=(5, 5, 5), stride=(1, 1, 1))
    )
  )
)
Device used for training: cuda:0
  0% 0/100 [00:00<?, ?it/s]

wandb: Waiting for W&B process to finish, PID 859... (failed 1). Press ctrl-c to abort syncing.
wandb:                                                                                
wandb: Run history:
wandb:   train_epoch ▁▁▁
wandb:    train_loss ▁▁█
wandb: 
wandb: Run summary:
wandb:   train_epoch 0
wandb:    train_loss 607629.75
wandb: 
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Synced true-forest-14: https://wandb.ai/dezzardhd/large_dataset/runs/3qfwvzdm
wandb: Find logs at: ./wandb/run-20220212_210203-3qfwvzdm/logs/debug.log
wandb: 
Traceback (most recent call last):
  File "main.py", line 47, in <module>
    train_setups.start_training_sessions(project=project)
  File "/content/drive/MyDrive/Workspace/large_dataset_0/train_setups.py", line 15, in start_training_sessions
    model_pipeline(config, project=project)
  File "/content/drive/MyDrive/Workspace/large_dataset_0/learning.py", line 76, in model_pipeline
    train(model, train_loader, validation_loader, criterion, optimizer, scheduler, config)
  File "/content/drive/MyDrive/Workspace/large_dataset_0/learning.py", line 149, in train
    loss = train_batch(batch, model, optimizer, criterion)
  File "/content/drive/MyDrive/Workspace/large_dataset_0/learning.py", line 176, in train_batch
    loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
  File "/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py", line 282, in <lambda>
    handle = var.register_hook(lambda grad: _callback(grad, log_track))
  File "/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py", line 280, in _callback
    self.log_tensor_stats(grad.data, name)
  File "/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py", line 232, in log_tensor_stats
    tensor = flat.histc(bins=self._num_bins, min=tmin, max=tmax)
RuntimeError: max must be larger than min

Thanks for the tip!

After setting wandb.watch(log=None) the RuntimeError disappeared.
My train loss became NaN. Managed to fix the problem by normalizing data.

Maybe this error occurs due to high values in loss… idk…

Hi @dezzardhd , this sounds like an issue with overflow/underflow

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.