Over the last decade, multimodal vision-language (VL) research has seen impressive progress. We can now automatically caption images in natural language, answer natural language questions about images, retrieve images using complex natural language queries and even generate images given natural language descriptions.Despite such tremendous progress, current VL research faces several challenges that limit the applicability of state-of-art VL systems. Even large VL systems based on multimodal large language models (LLMs) such as GPT-4V struggle with counting objects in images, identifying fine-grained differences between similar images, and lack sufficient visual grounding (i.e., make-up visual facts). In this talk, first I will present our work on building a parameter efficient multimodal LLM. Then, I will present our more recent work studying and tackling the following outstanding challenges in VL research: visio-linguistic compositional reasoning, robust automatic evaluation, and geo-diverse cultural understanding.
References
Zhang, L., Awal, R., Agrawal, A. 2024. Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Manas, O., Krojer, B., Agrawal, A. 2024. Improving Automatic VQA Evaluation Using Large Language Models. In the 38th Annual AAAI Conference on Artificial Intelligence.
Ahmadi, S., Agrawal, A. 2024. An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics. In the Findings of the Association for Computational Linguistics: EACL 2024.
Manas, O., Rodriguez, P., Ahmadi, S., Nematzadeh A., Goyal, Y., Agrawal A. 2023. MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting. In the European Chapter of the Association for Computational Linguistics (EACL).