Wanb.watch(model) causing CUDA OOM

I am trying to use wandb gradient visualization to debug the gradient flow in my neural net on Google Colab. Without wandb logging, the training runs without error, taking up 11Gb/16GB on the p100 gpu. However, adding this line wandb.watch(model, log='all', log_freq=3) causes a cuda out of memory error. How does wandb logging create extra gpu memory overhead? Is there some way to reduce the overhead? Thank you for your help.

2 Likes

Hello and welcome to the forums @ambrose! :wave:

Please do introduce yourself in the #start-here category if you’d like to!

Please allow me to replicate this issue, and ask the team for help.
I’ll get back once I’m able to replicate the issue, Thanks for the Q! :slight_smile:

Hi @bhutanisanyam1,

Thank you for your reply and welcome! I am quite excited to use WandB and join the community.

Ambrose

Hmm I think WandB is creating extra copies of the gradients during the logging. In case it helps, here is the error traceback:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-13de83557b55> in <module>()
     60         get_ipython().system("nvidia-smi | grep MiB | awk '{print $9 $10 $11}'")
     61 
---> 62         loss.backward()
     63 
     64         print('check 10')

4 frames
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    253                 create_graph=create_graph,
    254                 inputs=inputs)
--> 255         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    256 
    257     def register_hook(self, hook):

/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    147     Variable._execution_engine.run_backward(
    148         tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 149         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
    150 
    151 

/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py in <lambda>(grad)
    283             self.log_tensor_stats(grad.data, name)
    284 
--> 285         handle = var.register_hook(lambda grad: _callback(grad, log_track))
    286         self._hook_handles[name] = handle
    287         return handle

/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py in _callback(grad, log_track)
    281             if not log_track_update(log_track):
    282                 return
--> 283             self.log_tensor_stats(grad.data, name)
    284 
    285         handle = var.register_hook(lambda grad: _callback(grad, log_track))

/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py in log_tensor_stats(self, tensor, name)
    219         # Remove nans from tensor. There's no good way to represent that in histograms.
    220         flat = flat[~torch.isnan(flat)]
--> 221         flat = flat[~torch.isinf(flat)]
    222         if flat.shape == torch.Size([0]):
    223             # Often the whole tensor is nan or inf. Just don't log it in that case.

RuntimeError: CUDA out of memory. Tried to allocate 4.65 GiB (GPU 0; 15.90 GiB total capacity; 10.10 GiB already allocated; 717.75 MiB free; 14.27 GiB reserved in total by PyTorch)

Indeed, commenting out the offending line flat = flat[~torch.isinf(flat)] gets the WandB log step to just barely fit into the GPU memory. This is not a great solution though.