Benchmarking and Evaluation in NLP: How Do We Know What LLMs Can Do?

My Session Status

What:

Talk

Part of:

Day 8

When:

11:00 AM, miércoles 12 jun 2024 EDT (1 hour 30 minutos)

Theme:

Large Language Models: Applications, Ethics & Risks

Conflicting claims about how large language models (LLMs) “can do X”, “have property Y”, or even “know Z” have been made in recent literature in natural language processing (NLP) and related fields, as well as in popular media. However, unclear and often inconsistent standards for how to infer these conclusions from experimental results bring the the validity of such claims into question. In this lecture, I focus on the crucial role that benchmarking and evaluation methodology in NLP plays in assessing LLMs’ capabilities. I review common practices in the evaluation of NLP systems, including types of evaluation metrics, assumptions regarding these evaluations, and contexts in which they are applied. I then present case studies showing how less than careful application of current practices may result in invalid claims about model capabilities. Finally, I present our current efforts to encourage more structured reflection during the process of benchmark design and creation by introducing a novel framework, Evidence-Centred Benchmark Design, inspired by work in educational assessment.

References

Porada, I., Zou, X., & Cheung, J. C. K. (2024). A Controlled Reevaluation of Coreference Resolution Models. arXiv preprint arXiv:2404.00727.

Liu, Y. L., Cao, M., Blodgett, S. L., Cheung, J. C. K., Olteanu, A., & Trischler, A. (2023). Responsible AI Considerations in Text Summarization Research: A Review of Current Practices. arXiv preprint arXiv:2311.11103.

Jackie Chit Kit Cheung

Speaker

My Session Status

Allow attendees to rate sessions with a "thumbs up" or "thumbs down".

Allow attendees to send feedback about sessions

Allows attendees to send short textual feedback to the organizer for a session. This is only sent to the organizer and not the speakers.

Enable list of attendees for sessions

When enabled, you can choose to display attendee lists for individual sessions. Only attendees who have chosen to share their profile will be listed.

Display the list of attendees for this session

Enable to display the attendee list on this session's detail page. This change applies only to this session.

Allow attendees to participate in a discussion thread for sessions

Changes here will affect all session detail pages unless otherwise noted

Benchmarking and Evaluation in NLP: How Do We Know What LLMs Can Do?

My Session Status

References

My Session Status

Session detail