Kreuzberg – Modern async Python library for document text extraction

Kreuzberg is a Python library designed for text extraction from a variety of document formats like PDFs, images, and office documents. It boasts a clean and hassle-free API, local processing without external dependencies, and lightweight efficiency. It supports a wide range of formats and is optimized for modern async applications, serverless functions, and dockerized setups. With flexible configuration options for OCR and performance, Kreuzberg offers both async and sync APIs for single item and batch processing. It employs smart PDF text extraction techniques, including automatic OCR fallback when necessary. Error handling is robust, and contributions are welcome under the MIT license.

https://github.com/Goldziher/kreuzberg