AlpacaEval is an LLM-based automatic evaluation tool that is fast and reliable. It is built on the AlpacaFarm evaluation set, which measures the ability of models to follow general instructions given by users. The evaluation compares model responses to reference Davinci003 responses using auto-annotators like GPT-4, Claude, or ChatGPT. The tool has a high agreement rate with human annotations and its leaderboard rankings show strong correlation with rankings by human annotators. New contributions from the community are welcome, both in terms of new models for the leaderboard and new evaluators or evaluation sets. However, it is important to note that AlpacaEval has certain limitations, such as potential bias towards longer responses and the lack of safety evaluation for models.
https://tatsu-lab.github.io/alpaca_eval/