ResNets - Deep Residual Learning for Image Recognition

amanarora · September 9, 2021, 12:50am

Yes. That’s correct!

amanarora · September 9, 2021, 12:51am

Sure! Join me for our live coding session next week and I will make that happen.

nikem · September 9, 2021, 6:03am

Hello Aman video link above somehow not working. Possible to check it for the public access?

lessing · September 9, 2021, 8:34am

Hi Nikem,

Good catch - as we didn’t live stream this session to YouTube we’re uploading the recording later today. Removed the link for now - will add the new one once it’s up

durgaamma2005 · September 10, 2021, 7:42am

link to paper is https://arxiv.org/pdf/1512.03385.pdf
link in first post is directing to MDETR

lessing · September 10, 2021, 9:11am

Paper link updated!
@nikem the recording is now up on Youtube

durgaamma2005 · September 10, 2021, 11:24am

this is good one for more clarity

durgaamma2005 · September 13, 2021, 1:21pm

anybody can explain what is vanishing gradient problem is? I understood that gradients are too small when it is reaching early layers so the update is almost negligible and training wont happen further, But why it happens that gradient is very small while traversing to early layers. why all the derivatives of weights are generally less than one? this being the reason for vanishing gradients though understandable, hard for intuition.

vinayak_nayak · September 13, 2021, 3:19pm

Hi @durgaamma2005 ,

Really great question!

I think the reason is as follows but would love if others could correct me where I am wrong. Firstly, I can figure out you have understood that the chain rule of derivatives (multiplication basically) is causing small numbers to get multiplied and by the time it reaches earlier layers, update becomes negligible.

But what you want to understand is why are the derivatives by themselves less than unit magnitude or very small numbers right? If so, it is because of the nature of activation functions. There is a term in backprop in which we have to compute the gradient of activations wrt inputs when we are backpropagating.

Consider the graph of sigmoid function and that of it’s derivative:

Irrespective of whatever the x is the derivative is always less than a unit. Now imagine a deep network with many sigmoid activations and you’ll understand why the compounding happens.

Also, you can look at tanh activation below.

Here, you may say in the range [0-0.5] or a little more than 0.5, the derivative actually exceeds the original value which is true. However, the network is ultimately trying to saturate the neurons to +1 or -1 where the derivative value is less than the original output, and it’s magnitude is still less than unity which means repeatedly multiplying numbers between (0-1) will lead to smaller and smaller numbers backpropagating to earlier layers.

vinayak_nayak · September 14, 2021, 3:36pm

Hey guys,

Good Evening!

This is essentially the depiction of a skip connection (right) and the residual path (left). Now the skip connection is supposed to be an identity layer. However, why is it supposed to be an identity layer or identity mapping? I mean why is the concept introduced as such that we need to pass the input as is to successive layers by using these jump/skip connections?

I was going through #fastbook chapter 14 on resnets and saw that when we code resblocks, we have learnable parameters in the identity/skip connection.

This is done in order to change the shape of x such that it could possibly be added to F(x) i.e. conv2(conv1(x)). Otherwise there would be channels mismatch. Also we deliberately include pooling operation in case of larger strides to make sure that the spatial dimension of F(x) is the same as x.

So, to begin with the skip connection was never an identity mapping for any stride other than 1 and for any configuration other than ni = nf i.e. no change in number of feature maps (which is bad because then we’re not compressing information in any way when we go from input to output.) The moment stride becomes >= 2 and ni != nf, we have to do pooling &/or convolutions (1 x 1) to make the skip connection work.

So, I don’t get what’s the logic behind saying the skip connection should ideally learn an identity mapping and the residual path would learn how off that mapping is from the original input…

Can someone help me understand the intention of the operation?

Thanks !

nikem · October 24, 2021, 1:03am

This is my understanding, check the first video if you have time and also there are some good resources at the bottom of the notebook. ( Andrew Ng sessions are great) https://niyazikemer.com/fastbook/2021/10/24/resnet-live.html

sarat · October 26, 2021, 8:26am

I have implemented ResNet from scratch in pytorch and posted it as kaggle kernel.

It currently achieves 76% accuracy, I will be tuning it further.

RestNet from Scratch in PyTorch | Kaggle

Thanks @amanarora for your video series.

sarat · October 26, 2021, 8:31am

As the nets get deeper, the deeper nets have are having more training error than shallow counter parts, which doesn’t make sense as the deeper network can be thought of as shallow + identity (just pass the input as output).

But learning such a mapping is hard. So we introduced these connections with which identity is easy as network should basically have close to zero weights and we end up with identity mapping.

Topic		Replies	Views
Week 13 Discussion Thread Fastbook Reading Group	23	3907	September 6, 2021
Starting with Object Detection Show the Community!	0	669	September 13, 2021
Master List: Bi-weekly Paper Reading Group Paper Reading Group	4	3299	November 2, 2021
Week 15 Discussion Thread Fastbook Reading Group	29	1670	September 24, 2021
DenseNet Paper Reading Group	30	2409	September 23, 2021

ResNets - Deep Residual Learning for Image Recognition

Related topics