Authors Ahmed Nassar, Andres Marafioti, Matteo Omenetti, and others have developed a groundbreaking vision-language model called SmolDocling, aimed at document conversion. This model uses a new markup format called DocTags to accurately capture all page elements, including code listings, tables, equations, and more, with spatial location details. Unlike other models that rely on large foundational models or multiple specialized models, SmolDocling offers end-to-end conversion with just 256M parameters. Surprisingly, it competes with models up to 27 times larger while significantly reducing computational requirements. The model is available now, with datasets to be publicly released soon.
https://arxiv.org/abs/2503.11576