Needle in a Needlestack is a challenging benchmark that tests how well LLMs can pay attention to specific information in a context window filled with thousands of limericks. GPT-4 Turbo and Claude-3 Sonnet struggled with this test, but GPT-4o has recently made a breakthrough and performs almost perfectly. Mistral’s models, on the other hand, had a hard time with the benchmark, with the 8×22 model only answering correctly 50% of the time. Shorter prompts yield better results, and repeating information significantly boosts performance, as seen with GPT-3.5 Turbo. It’s surprising how much repeating information can impact the success of these models on the test.
http://nian.llmonpy.ai/