Skip to main page content

Multimodal Vision-Language Learning

My Session Status

What:
Talk
When:
1:30 PM, Friday 14 Jun 2024 EDT (1 hour 30 minutes)
Theme:
Large Language Models & Multimodal Grounding
Over the last decade, multimodal vision-language (VL) research has seen impressive progress. We can now automatically caption images in natural language, answer natural language questions about images, retrieve images using complex natural language queries and even generate images given natural language descriptions.Despite such tremendous progress, current VL research faces several challenges that limit the applicability of state-of-art VL systems. Even large VL systems based on multimodal large language models (LLMs) such as GPT-4V struggle with counting objects in images, identifying fine-grained differences between similar images, and lack sufficient visual grounding (i.e., make-up visual facts). In this talk, first I will present our work on building a parameter efficient multimodal LLM. Then, I will present our more recent work studying and tackling the following outstanding challenges in VL research: visio-linguistic compositional reasoning, robust automatic evaluation, and geo-diverse cultural understanding.

 

References

Zhang, L., Awal, R., Agrawal, A. 2024. Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Manas, O., Krojer, B., Agrawal, A. 2024. Improving Automatic VQA Evaluation Using Large Language Models. In the 38th Annual AAAI Conference on Artificial Intelligence.

Ahmadi, S., Agrawal, A. 2024. An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics. In the Findings of the Association for Computational Linguistics: EACL 2024.

Manas, O., Rodriguez, P., Ahmadi, S., Nematzadeh A., Goyal, Y., Agrawal A. 2023. MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting. In the European Chapter of the Association for Computational Linguistics (EACL).

My Session Status

Session detail
Allows attendees to send short textual feedback to the organizer for a session. This is only sent to the organizer and not the speakers.
To respect data privacy rules, this option only displays profiles of attendees who have chosen to share their profile information publicly.

Changes here will affect all session detail pages