TLDR: I used a combination of LLMs, Embeddings Models, XGBoost, and LinearRegressors to classify the SafeDocs dataset, creating pretty graphs in the process. The dataset includes 8.4 million PDFs totaling 8TB, making it the largest PDF dataset online. I classified the PDFs based on various labels using the metadata and generated 100k labels using a prompt. Experimenting with different models, XGBoost proved to be the most accurate, followed by a LinearRegressor ensemble using TFIDF. Despite some setbacks with deep learning models, XGBoost with embeddings had the highest accuracy of 85.26% after a hyperparameter sweep. The predictions were visualized using PCA and UMAP, with the datasets released for further exploration.
https://snats.xyz/pages/articles/classifying_a_bunch_of_pdfs.html