Multimodal Vision-Language Learning

My Session Status

What:

Talk

Part of:

Day 10 | Large Language Models & Multimodal Grounding

When:

1:30 PM, Friday 14 Jun 2024 EDT (1 hour 30 minutes)

Theme:

Large Language Models & Multimodal Grounding

Over the last decade, multimodal vision-language (VL) research has seen impressive progress. We can now automatically caption images in natural language, answer natural language questions about images, retrieve images using complex natural language queries and even generate images given natural language descriptions.Despite such tremendous progress, current VL research faces several challenges that limit the applicability of state-of-art VL systems. Even large VL systems based on multimodal large language models (LLMs) such as GPT-4V struggle with counting objects in images, identifying fine-grained differences between similar images, and lack sufficient visual grounding (i.e., make-up visual facts). In this talk, first I will present our work on building a parameter efficient multimodal LLM. Then, I will present our more recent work studying and tackling the following outstanding challenges in VL research: visio-linguistic compositional reasoning, robust automatic evaluation, and geo-diverse cultural understanding.

References

Zhang, L., Awal, R., Agrawal, A. 2024. Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Manas, O., Krojer, B., Agrawal, A. 2024. Improving Automatic VQA Evaluation Using Large Language Models. In the 38th Annual AAAI Conference on Artificial Intelligence.

Ahmadi, S., Agrawal, A. 2024. An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics. In the Findings of the Association for Computational Linguistics: EACL 2024.

Manas, O., Rodriguez, P., Ahmadi, S., Nematzadeh A., Goyal, Y., Agrawal A. 2023. MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting. In the European Chapter of the Association for Computational Linguistics (EACL).

Aishwarya Agrawal

Speaker

My Session Status

Allow attendees to rate sessions with a "thumbs up" or "thumbs down".

Allow attendees to send feedback about sessions

Allows attendees to send short textual feedback to the organizer for a session. This is only sent to the organizer and not the speakers.

Display the list of attendees for each session.

To respect data privacy rules, this option only displays profiles of attendees who have chosen to share their profile information publicly.

Allow attendees to participate in a discussion thread for sessions

Changes here will affect all session detail pages

Multimodal Vision-Language Learning

My Session Status

References

My Session Status

Session detail