Wanb.watch(model) causing CUDA OOM

ambrose · September 10, 2021, 3:20pm

I am trying to use wandb gradient visualization to debug the gradient flow in my neural net on Google Colab. Without wandb logging, the training runs without error, taking up 11Gb/16GB on the p100 gpu. However, adding this line wandb.watch(model, log='all', log_freq=3) causes a cuda out of memory error. How does wandb logging create extra gpu memory overhead? Is there some way to reduce the overhead? Thank you for your help.

bhutanisanyam1 · September 10, 2021, 8:21pm

Hello and welcome to the forums @ambrose!

Please do introduce yourself in the #start-here category if you’d like to!

Please allow me to replicate this issue, and ask the team for help.
I’ll get back once I’m able to replicate the issue, Thanks for the Q!

ambrose · September 11, 2021, 12:52am

Hi @bhutanisanyam1,

Thank you for your reply and welcome! I am quite excited to use WandB and join the community.

Ambrose

ambrose · September 11, 2021, 7:32pm

Hmm I think WandB is creating extra copies of the gradients during the logging. In case it helps, here is the error traceback:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-13de83557b55> in <module>()
     60         get_ipython().system("nvidia-smi | grep MiB | awk '{print $9 $10 $11}'")
     61 
---> 62         loss.backward()
     63 
     64         print('check 10')

4 frames
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    253                 create_graph=create_graph,
    254                 inputs=inputs)
--> 255         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    256 
    257     def register_hook(self, hook):

/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    147     Variable._execution_engine.run_backward(
    148         tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 149         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
    150 
    151 

/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py in <lambda>(grad)
    283             self.log_tensor_stats(grad.data, name)
    284 
--> 285         handle = var.register_hook(lambda grad: _callback(grad, log_track))
    286         self._hook_handles[name] = handle
    287         return handle

/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py in _callback(grad, log_track)
    281             if not log_track_update(log_track):
    282                 return
--> 283             self.log_tensor_stats(grad.data, name)
    284 
    285         handle = var.register_hook(lambda grad: _callback(grad, log_track))

/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py in log_tensor_stats(self, tensor, name)
    219         # Remove nans from tensor. There's no good way to represent that in histograms.
    220         flat = flat[~torch.isnan(flat)]
--> 221         flat = flat[~torch.isinf(flat)]
    222         if flat.shape == torch.Size([0]):
    223             # Often the whole tensor is nan or inf. Just don't log it in that case.

RuntimeError: CUDA out of memory. Tried to allocate 4.65 GiB (GPU 0; 15.90 GiB total capacity; 10.10 GiB already allocated; 717.75 MiB free; 14.27 GiB reserved in total by PyTorch)

ambrose · September 11, 2021, 8:00pm

Indeed, commenting out the offending line flat = flat[~torch.isinf(flat)] gets the WandB log step to just barely fit into the GPU memory. This is not a great solution though.

system · April 20, 2022, 6:02pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wandb.watch with pytorch not logging anything W&B Help	4	3316	May 17, 2022
Why is min and max causing errors when logging gradients for biases in model? W&B Help	5	1292	April 20, 2022
Wandb.watch not logging parameters W&B Help	19	2066	February 5, 2022
Wandb.watch with PyTorch Lightning not logging W&B Help dashboard , wandb	2	1345	August 9, 2022
Why does wandb.watch() monitor some parameters' gradients twice? W&B Help wandb	3	407	February 16, 2024

Wanb.watch(model) causing CUDA OOM

Related topics