MDETR — Modulated Detection for End-to-End Multi-Modal Understanding

MDETR Paper:
YouTube stream:


Test - I am going to make this 20 chars.

I’m tuning in just to learn from Aishwarya, and not Aman :grin:

1 Like

Thanks for joining us today, Aishwarya, I’ve a “broad” question:
I always struggle w setting up baseline models when trying out a research idea or problem, I’m curious about how do you approach this for new ideas?

Q by Ramesh Sampath: If there’s no fixed number of classes in output, the predicted boxes is doing soft distribution over input tokens like attention?

Is the Ground Truth captured with the pairs of Boxes and words that correspond to as well

Edit: Aishwarya answered it in the chat:

Yup soft distribution supervised with a soft cross entropy
Its a many to many mapping

Q from chat: can you go over the flattening and concatenating of image features and text features? I didn’t quite understand what the final shape is expected to be before being passed to transformer

Edit: This was answered in the video, please watch the last AMA bit for the answer :tea:

I am still not very clear on Contrastive Loss / Alignment. Is it like the Triplet loss of minimizing the Cosine? distance between similar things? How do you handle Negative examples? Sample them?

1 Like

What downstream tasks we can use MDETR for? Can it be for Image only downstream tasks or does it need to be Image / Text?

In GQA downstream task, does the labels need to be as detailed as the initial pre-training? How much data is required for downstream task?

1 Like

What changes were there in the training process in the end-to-end training. Any specific Data Augmentation on Images / Text?

I am also little unclear on the pre-training reference in the talk. The initial CNN, RoBERTa models are frozen or is that also tuned?

1 Like

I am so sorry I missed this question!

No problems! I’ll try to ask it when we invite Aishwarya for a talk the next time :smiley: