Benchmarking GPT-4 Turbo – A Cautionary Tale

Mentat, the popular code editing tool, has been using the GPT-4 model as its default choice. However, the new GPT-4 Turbo model promises to offer better quality code edits at a lower cost. To evaluate the performance of GPT-4 Turbo, a benchmarking test was conducted using Exercism programming exercises. The results showed that GPT-4 solved 70% of the JavaScript exercises, while GPT-4 Turbo solved 68.8% of the exercises. Upon closer analysis, it was discovered that GPT-4 had a higher success rate on its first attempt, indicating that it had memorized the training tasks. In contrast, GPT-4 Turbo struggled initially due to unclear instructions. This suggests that GPT-4 Turbo may have lost some of the memorization capability during downsizing. Further tests without showing instructions confirmed that GPT-4 had a higher memorization capacity compared to GPT-4 Turbo. The study concludes that benchmark tests derived from training data are useful for comparing fine-tuned models but may not accurately reflect the performance of models trained on separate datasets or distilled models like GPT-4 Turbo. The importance of developing better benchmarks is emphasized to accurately compare different models

https://blog.mentat.ai/benchmarking-gpt-4-turbo-a-cautionary-tale