#3 PyTorch Book Thread: Sunday, 12th Sept 8AM PT

Added these resources:


How to experiment with different activation functions. Mathematically I am able to understand them, but how to get their essemce for application?

I would suggest, take any simple architecture say ResNet-18 replace ReLU with Tanhx or sigmoid and compare things the training performance, losses, etc. you will get the gist of them.

1 Like

Recently WandB’s official YouTube channel had a video about Torch.nn as well:

Also Abhishek Thakur covered the topic: 9. Understanding torch.nn - YouTube

1 Like

As Jeremy once said and Sanyam said a while ago, the more u experiment with code and get your hands dirty the more stronger your intuitions will become.


OrderedDicts are really helpful when you have a big architecture.

And to mess with them when you want to without much trouble.

(left at 2147 IST)


why we apply super, is that required by pytorch?

1 Like

since the class we create, inherits nn.Module… to initialise the components(variable, etc.) of that base class we use super.

1 Like

Suggested Homework:

  • Checkout different loss functions in torch

  • Try new activation functions

  • Play around with NN hyperparameters

  • Try a new dataset from torchvision

  • Try torchvision transforms


Can you please talk about hangout event once more?

1 Like

You can find more info at the following link:


Hi!! I was working on the SGD optimizer and noticed that it gives the loss=Nan when I use t_u(not normalized), the authors use t_un=0.1*t_u. Why does this happen?

1 Like

Is this the problem of exploding gradients if I understand it correctly? And therefore normalization is necessary?

1 Like

@dhruvashist Welcome to the community! :slight_smile:

Yes, you’re right about gradient explosion.

I was working on the validation set loss and found the loss to decrease, then increase, and then stabilize. It seems something is wrong but I know what to look for. Any advice/help is appreciated

Also, the authors have a small difference between the training and validation loss. Mine seems quite large. Is this OK or do I need to rework it?

1 Like

The loss was reduced from 8.0 to 3.0 when training again and again with different splits. Looks like this is why cross-validation is important.

1 Like

When using a sequential model(13 neurons), these are the shapes of the parameters for both layers.

[torch.Size([13, 1]), torch.Size([13]), torch.Size([1, 13]), torch.Size([1])]

Why do biases in the first layer have shape [13] and not [13,1] and vice-versa?

I prepared a small notebook for CNN but it’s with tensorflow. Any feedback would be really appreciated, thanks!

I had read the chapter 5 mechanics of learning earlier but i recently read it again and its basically -simple high school stuff of differetiating functions and having slopes or here they say gradients ,
But the whole point is that the developer have to come from far high level dealing complex problems to basic statistics and make the reader realize the loss funtion and then differentiating it wrt weights and biases then optimizing it all with code as good as pen and paper , it makes us realize that as a user of torch. nn or even sklearn ,how close we get to the truth but yet we are too far with our implementations …
The chapter starts with simple " mx + c " and beautifully fits and optimizes everything in our world and you can only realize the calmness if you truly try to forget everything that you have learnt about ML or DL… The only way to enjoy this chapter is to know x^2 derivative wrt x is 2x and nothing more!

1 Like