ML Sprint: Transformers Wiki!

Hi Everybody!
As part of our first community hangout, we’re excited to be hosting a few sprints. This is one of the same:

The plan with ML Sprints is to run week-long activities where our community will contribute to projects.

This is one of three wikis that we’re inviting you to contribute to! This Wiki is meant to serve as a collection of best resources to learn about Transformer models and their applications.

This is a wiki! This means all of you can edit it, please do so!

Attention Is All You Need (2017)
End-to-End Object Detection with Transformers (2020)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2021)

Blog posts:
The Annotated Transformer
Transformer Deep Dive
The illustrated Transformer

Explanation videos:
Attention is All you need by Yannic Kilcher
GPT-2 by Yannic Kilcher
BERT by Yannic Kilcher
RoBERTa by Yannic Kilcher

Kaggle Notebooks:
Utilizing Transformer Representations Efficiently
On Stability of Few-Sample Transformer Fine-Tuning
Speeding up Transformer w/ Optimization Strategies


If you just want to get the hang of transformers in one post it would definitely be this one from Jay Alammar The illustrated Transformer


This playlist from Ms. Cofee Bean is informative as well.

The Transformer explained by Ms. Coffee Bean

I specifically like this diagram from Chirs McCormick on what you need to know to understand transformers:

  1. A great lecture by Dr. Rachel Tomas, about the fundamental idea behind Transformers. YouTube link
  2. “Attention is All You Need” paper read through by Yannic Kilcher. YouTube link

These are some of the paper walk-throughs everyone should go through at least once:

  1. Attention is All you need by Yannic Kilcher
  2. GPT-2 by Yannic Kilcher
  3. BERT by Yannic Kilcher
  4. RoBERTa by Yannic Kilcher

These kernels are good to learn from the application point of view:

  1. Different ways to utilize transformer representations Utilizing Transformer Representations Efficiently
  2. Stabilizing training of transformer models On Stability of Few-Sample Transformer Fine-Tuning
  3. Speeding up transformer training Speeding up Transformer w/ Optimization Strategies