MDETR — Modulated Detection for End-to-End Multi-Modal Understanding

lessing · August 20, 2021, 12:49pm

Register: wandb.me/prg
MDETR Paper: arxiv.org/abs/2104.12763
YouTube stream:

amanarora · August 24, 2021, 11:08pm

Test - I am going to make this 20 chars.

bhutanisanyam1 · August 24, 2021, 11:08pm

I’m tuning in just to learn from Aishwarya, and not Aman

bhutanisanyam1 · August 24, 2021, 11:13pm

Thanks for joining us today, Aishwarya, I’ve a “broad” question:
I always struggle w setting up baseline models when trying out a research idea or problem, I’m curious about how do you approach this for new ideas?

bhutanisanyam1 · August 24, 2021, 11:22pm

Q by Ramesh Sampath: If there’s no fixed number of classes in output, the predicted boxes is doing soft distribution over input tokens like attention?

Is the Ground Truth captured with the pairs of Boxes and words that correspond to as well

Edit: Aishwarya answered it in the chat:

Yup soft distribution supervised with a soft cross entropy
Its a many to many mapping

bhutanisanyam1 · August 24, 2021, 11:31pm

Q from chat: can you go over the flattening and concatenating of image features and text features? I didn’t quite understand what the final shape is expected to be before being passed to transformer

Edit: This was answered in the video, please watch the last AMA bit for the answer

ramesh · August 24, 2021, 11:39pm

I am still not very clear on Contrastive Loss / Alignment. Is it like the Triplet loss of minimizing the Cosine? distance between similar things? How do you handle Negative examples? Sample them?

ramesh · August 24, 2021, 11:44pm

What downstream tasks we can use MDETR for? Can it be for Image only downstream tasks or does it need to be Image / Text?

In GQA downstream task, does the labels need to be as detailed as the initial pre-training? How much data is required for downstream task?

ramesh · August 24, 2021, 11:45pm

What changes were there in the training process in the end-to-end training. Any specific Data Augmentation on Images / Text?

I am also little unclear on the pre-training reference in the talk. The initial CNN, RoBERTa models are frozen or is that also tuned?

amanarora · August 25, 2021, 2:27am

I am so sorry I missed this question!

bhutanisanyam1 · August 25, 2021, 5:52am

No problems! I’ll try to ask it when we invite Aishwarya for a talk the next time

Topic		Replies	Views
Master List: Bi-weekly Paper Reading Group Paper Reading Group	4	3301	November 2, 2021
ML Sprint: Transformers Wiki! Show the Community!	4	1278	September 14, 2021
NeurIPS edition Paper Reading Group	9	1307	December 19, 2021
#4 JAX Course - JAX Implementation of DALL·E. - Generate images from a prompt JAX	13	1221	December 20, 2021
Week 15 Discussion Thread Fastbook Reading Group	29	1670	September 24, 2021

MDETR — Modulated Detection for End-to-End Multi-Modal Understanding

Related topics