Lessons from the Trenches on Reproducible Evaluation of Language Models

In this paper, researchers address the challenges surrounding the evaluation of language models in NLP. They highlight issues such as model sensitivity, comparison difficulties, and the lack of reproducibility. Drawing on three years of experience, they offer guidance and best practices for researchers in the field. The authors also introduce the Language Model Evaluation Harness (lm-eval), an open-source library designed to improve the evaluation process and promote transparency. This tool aims to provide a solution to the methodological concerns faced by researchers, offering a unique approach to evaluating language models.

https://arxiv.org/abs/2405.14782