Multimodal Vision-Language Learning

Mon statut pour la session

Quoi:

Talk

Partie de:

Jour 10

Quand:

1:30 PM, Vendredi 14 Juin 2024 EDT (1 heure 30 minutes)

Thème:

Large Language Models & Multimodal Grounding

Over the last decade, multimodal vision-language (VL) research has seen impressive progress. We can now automatically caption images in natural language, answer natural language questions about images, retrieve images using complex natural language queries and even generate images given natural language descriptions.Despite such tremendous progress, current VL research faces several challenges that limit the applicability of state-of-art VL systems. Even large VL systems based on multimodal large language models (LLMs) such as GPT-4V struggle with counting objects in images, identifying fine-grained differences between similar images, and lack sufficient visual grounding (i.e., make-up visual facts). In this talk, first I will present our work on building a parameter efficient multimodal LLM. Then, I will present our more recent work studying and tackling the following outstanding challenges in VL research: visio-linguistic compositional reasoning, robust automatic evaluation, and geo-diverse cultural understanding.

References

Zhang, L., Awal, R., Agrawal, A. 2024. Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Manas, O., Krojer, B., Agrawal, A. 2024. Improving Automatic VQA Evaluation Using Large Language Models. In the 38th Annual AAAI Conference on Artificial Intelligence.

Ahmadi, S., Agrawal, A. 2024. An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics. In the Findings of the Association for Computational Linguistics: EACL 2024.

Manas, O., Rodriguez, P., Ahmadi, S., Nematzadeh A., Goyal, Y., Agrawal A. 2023. MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting. In the European Chapter of the Association for Computational Linguistics (EACL).

Aishwarya Agrawal

Conférencier.ère

Mon statut pour la session

Permettre aux participants d'évaluer les sessions avec un "pouces vers le haut/bas" (thumbs up/thumbs down).

Permettre aux participants d'envoyer un feedback à l'organisateur.

Pour chaque session, permet aux participants d'écrire un court texte de feedback qui sera envoyé à l'organisateur. Ce texte n'est pas envoyé aux présentateurs.

Activer la liste des participants pour les sessions

Une fois activée, vous pouvez choisir d'afficher la liste des participants pour chaque session. Seuls les participants ayant accepté de rendre leur profil public seront affichés.

Afficher la liste des participants pour cette session

Activez cette option pour afficher la liste des participants sur la page de cette session. Ce paramètre s'applique uniquement à cette session.

Permettre aux participants de participer à des discussions en ligne sur les sessions.

Les modifications effectuées ici affecteront toutes les pages de détails des sessions sauf indication contraire

Multimodal Vision-Language Learning

Mon statut pour la session

References

Mon statut pour la session

Detail de session

Nous utilisons des cookies