Yes. That’s correct!
Sure! Join me for our live coding session next week and I will make that happen.
Hello Aman video link above somehow not working. Possible to check it for the public access?
Good catch - as we didn’t live stream this session to YouTube we’re uploading the recording later today. Removed the link for now - will add the new one once it’s up
link to paper is https://arxiv.org/pdf/1512.03385.pdf
link in first post is directing to MDETR
this is good one for more clarity
anybody can explain what is vanishing gradient problem is? I understood that gradients are too small when it is reaching early layers so the update is almost negligible and training wont happen further, But why it happens that gradient is very small while traversing to early layers. why all the derivatives of weights are generally less than one? this being the reason for vanishing gradients though understandable, hard for intuition.
Hi @durgaamma2005 ,
Really great question!
I think the reason is as follows but would love if others could correct me where I am wrong. Firstly, I can figure out you have understood that the chain rule of derivatives (multiplication basically) is causing small numbers to get multiplied and by the time it reaches earlier layers, update becomes negligible.
But what you want to understand is why are the derivatives by themselves less than unit magnitude or very small numbers right? If so, it is because of the nature of activation functions. There is a term in backprop in which we have to compute the gradient of activations wrt inputs when we are backpropagating.
Consider the graph of sigmoid function and that of it’s derivative:
Irrespective of whatever the x is the derivative is always less than a unit. Now imagine a deep network with many sigmoid activations and you’ll understand why the compounding happens.
Also, you can look at tanh activation below.
Here, you may say in the range [0-0.5] or a little more than 0.5, the derivative actually exceeds the original value which is true. However, the network is ultimately trying to saturate the neurons to
-1 where the derivative value is less than the original output, and it’s magnitude is still less than unity which means repeatedly multiplying numbers between (0-1) will lead to smaller and smaller numbers backpropagating to earlier layers.
This is essentially the depiction of a skip connection (right) and the residual path (left). Now the skip connection is supposed to be an identity layer. However, why is it supposed to be an identity layer or identity mapping? I mean why is the concept introduced as such that we need to pass the input as is to successive layers by using these jump/skip connections?
I was going through #fastbook chapter 14 on resnets and saw that when we code resblocks, we have learnable parameters in the identity/skip connection.
This is done in order to change the shape of
x such that it could possibly be added to
conv2(conv1(x)). Otherwise there would be channels mismatch. Also we deliberately include pooling operation in case of larger strides to make sure that the spatial dimension of
F(x) is the same as
So, to begin with the skip connection was never an identity mapping for any stride other than 1 and for any configuration other than ni = nf i.e. no change in number of feature maps (which is bad because then we’re not compressing information in any way when we go from input to output.) The moment stride becomes >= 2 and ni != nf, we have to do pooling &/or convolutions (1 x 1) to make the skip connection work.
So, I don’t get what’s the logic behind saying the skip connection should ideally learn an identity mapping and the residual path would learn how off that mapping is from the original input…
Can someone help me understand the intention of the operation?
This is my understanding, check the first video if you have time and also there are some good resources at the bottom of the notebook. ( Andrew Ng sessions are great) https://niyazikemer.com/fastbook/2021/10/24/resnet-live.html
I have implemented ResNet from scratch in pytorch and posted it as kaggle kernel.
It currently achieves 76% accuracy, I will be tuning it further.
Thanks @amanarora for your video series.
As the nets get deeper, the deeper nets have are having more training error than shallow counter parts, which doesn’t make sense as the deeper network can be thought of as shallow + identity (just pass the input as output).
But learning such a mapping is hard. So we introduced these connections with which identity is easy as network should basically have close to zero weights and we end up with identity mapping.