Passer au contenu de la page principale

Multimodal Vision-Language Learning

Mon statut pour la session

Quoi:
Talk
Partie de:
Quand:
1:30 PM, Vendredi 14 Juin 2024 EDT (1 heure 30 minutes)
Thème:
Large Language Models & Multimodal Grounding
Over the last decade, multimodal vision-language (VL) research has seen impressive progress. We can now automatically caption images in natural language, answer natural language questions about images, retrieve images using complex natural language queries and even generate images given natural language descriptions.Despite such tremendous progress, current VL research faces several challenges that limit the applicability of state-of-art VL systems. Even large VL systems based on multimodal large language models (LLMs) such as GPT-4V struggle with counting objects in images, identifying fine-grained differences between similar images, and lack sufficient visual grounding (i.e., make-up visual facts). In this talk, first I will present our work on building a parameter efficient multimodal LLM. Then, I will present our more recent work studying and tackling the following outstanding challenges in VL research: visio-linguistic compositional reasoning, robust automatic evaluation, and geo-diverse cultural understanding.

 

References

Zhang, L., Awal, R., Agrawal, A. 2024. Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Manas, O., Krojer, B., Agrawal, A. 2024. Improving Automatic VQA Evaluation Using Large Language Models. In the 38th Annual AAAI Conference on Artificial Intelligence.

Ahmadi, S., Agrawal, A. 2024. An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics. In the Findings of the Association for Computational Linguistics: EACL 2024.

Manas, O., Rodriguez, P., Ahmadi, S., Nematzadeh A., Goyal, Y., Agrawal A. 2023. MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting. In the European Chapter of the Association for Computational Linguistics (EACL).

Aishwarya Agrawal

Conférencier.ère

Mon statut pour la session

Detail de session
Pour chaque session, permet aux participants d'écrire un court texte de feedback qui sera envoyé à l'organisateur. Ce texte n'est pas envoyé aux présentateurs.
Afin de respecter les règles de gestion des données privées, cette option affiche uniquement les profils des personnes qui ont accepté de partager leur profil publiquement.

Les changements ici affecteront toutes les pages de détails des sessions