This web content discusses practical patterns for integrating large language models (LLMs) into systems and products. The author emphasizes the importance of evaluations as a way to measure the performance of LLMs and detect any regressions. They highlight various benchmarks and metrics commonly used in language modeling, such as BLEU, ROUGE, BERTScore, and MoverScore, while acknowledging their limitations and poor correlation with human judgments. The author also explores the use of LLMs as reference-free metrics and the potential benefits of automated evaluations. They suggest collecting task-specific evals and using useful metrics to guide the development and improvement of LLM-based systems and products. The author also mentions the importance of addressing biases in LLM judgments and considering human evaluations as a valuable tool.
https://eugeneyan.com/writing/llm-patterns/