Internal representations of LLMs encode information about truthfulness

Large language models (LLMs) can make errors called “hallucinations”, including factual inaccuracies and biases. New research shows that LLMs’ internal states hold truthfulness information that can help detect errors. Truthfulness info is concentrated in specific tokens, improving error detection. However, error detectors don’t work across datasets, suggesting truthfulness encoding is complex. Internal representations can also predict error types to create tailored mitigation strategies. Interestingly, LLMs may know the correct answer but give the wrong one. These findings provide a deep look into LLM errors, guiding future research on error analysis and mitigation.

https://arxiv.org/abs/2410.02707