Gzip beats BERT? Part 2: dataset issues, improved speed, and results

In this blog post, the author provides an update to their analysis of a paper that proposed a text classification method using gzip and k-nearest neighbors (kNN). The paper claimed that this “simple” method outperformed benchmarks, including language models like BERT. However, the author identifies several issues with the paper’s methodology, including contaminated datasets and an unfair comparison to baseline results. They also present ways to improve the speed of the gzip-based method. Overall, the author concludes that many of the key results in the paper are not valid, but acknowledges the potential for interesting research using text compression for classification tasks.

https://kenschutte.com/gzip-knn-paper2/

To top