Asking 60 LLMs a set of 20 questions

The author discusses their experience with using benchmarks like HellaSwag, expressing that they find them too abstract to accurately gauge real-world performance. To address this, they created a script that tests reasoning, instruction following, and creativity on around 60 models, gathering the results in a SQLite database. The author shares some of the prompt results, including arguments against the Münchhausen trilemma and solving mathematical problems. They also mention their intention to improve the script in the future and invite feedback and suggestions. Additionally, the author briefly mentions their work on an observability tool for AI developers.

https://benchmarks.llmonitor.com