Benchmarking and Evaluation in NLP: How Do We Know What LLMs Can Do?
My Session Status
What:
Talk
Part of:
When:
11:00 AM, Wednesday 12 Jun 2024 EDT
(1 hour 30 minutes)
Theme:
Large Language Models: Applications, Ethics & Risks
Conflicting claims about how large language models (LLMs) “can do X”, “have property Y”, or even “know Z” have been made in recent literature in natural language processing (NLP) and related fields, as well as in popular media. However, unclear and often inconsistent standards for how to infer these conclusions from experimental results bring the the validity of such claims into question. In this lecture, I focus on the crucial role that benchmarking and evaluation methodology in NLP plays in assessing LLMs’ capabilities. I review common practices in the evaluation of NLP systems, including types of evaluation metrics, assumptions regarding these evaluations, and contexts in which they are applied. I then present case studies showing how less than careful application of current practices may result in invalid claims about model capabilities. Finally, I present our current efforts to encourage more structured reflection during the process of benchmark design and creation by introducing a novel framework, Evidence-Centred Benchmark Design, inspired by work in educational assessment.
References
Porada, I., Zou, X., & Cheung, J. C. K. (2024). A Controlled Reevaluation of Coreference Resolution Models. arXiv preprint arXiv:2404.00727.
Liu, Y. L., Cao, M., Blodgett, S. L., Cheung, J. C. K., Olteanu, A., & Trischler, A. (2023). Responsible AI Considerations in Text Summarization Research: A Review of Current Practices. arXiv preprint arXiv:2311.11103.