Hi, I’m using huggingface trainer framework to train my model and logging everything to W&B. The W&B histogram logger for gradients shows me the following pictures:
The out_proj layer is the last one in my architecture, so it was very unusual to see so high gradients magnitude. I took a look inside and found that the gradients are actually small, but they do get high within intermediate step of AMP gradient scaling and unscaling. So I guess that W&B just doesn’t work well with grad scaler.
Before unscaling:
model_ref.classifier.out_proj.weight.grad
tensor([[ 354.6875, -1280.5000, 538.1250, ..., 1188.5000, -150.5625,
1870.5000],
[ 93.0205, -1208.0000, 390.3750, ..., 534.8750, -38.1250,
738.2500],
[ -5.9375, 2244.5000, -1384.5000, ..., -503.1875, -11.0625,
-1256.3125],
[ 211.8281, 488.5000, 37.3750, ..., -1205.5000, 494.1250,
-1308.5000],
[ -664.7500, -67.0000, -156.6250, ..., 185.0000, -59.9531,
923.7500],
[ 11.0000, -177.6250, 574.6250, ..., -199.1250, -234.6953,
-966.3750]], device='cuda:0')
After unscaling:
model_ref.classifier.out_proj.weight.grad
tensor([[ 5.4121e-03, -1.9539e-02, 8.2111e-03, ..., 1.8135e-02,
-2.2974e-03, 2.8542e-02],
[ 1.4194e-03, -1.8433e-02, 5.9566e-03, ..., 8.1615e-03,
-5.8174e-04, 1.1265e-02],
[-9.0599e-05, 3.4248e-02, -2.1126e-02, ..., -7.6780e-03,
-1.6880e-04, -1.9170e-02],
[ 3.2322e-03, 7.4539e-03, 5.7030e-04, ..., -1.8394e-02,
7.5397e-03, -1.9966e-02],
[-1.0143e-02, -1.0223e-03, -2.3899e-03, ..., 2.8229e-03,
-9.1481e-04, 1.4095e-02],
[ 1.6785e-04, -2.7103e-03, 8.7681e-03, ..., -3.0384e-03,
-3.5812e-03, -1.4746e-02]], device='cuda:0')
If that is correct, I think this should be fixed to avoid misunderstanding (e.g. I though something is wrong with my model).