Wikipedia search-by-vibes through millions of pages offline

Lee Butterman has developed a browser-based search engine for Wikipedia that allows users to search for specific information using natural language queries. The search engine downloads the Wikipedia database and performs searches offline, making it convenient and efficient. By using sentence transformers to embed documents, product quantization to compress embeddings, pq.js for distance computation, and transformers.js for query processing, real-time search over millions of documents is possible. The search results are constantly updated and displayed in a top-10 ranking based on the information density of the pages. The database is small enough to support casual use cases, and the use of Arrow instead of JSON enables efficient storage. The search engine currently uses ONNX models running on WebAssembly without GPU acceleration. The developer provides step-by-step instructions for embedding Wikipedia, compressing embeddings, hand-writing ONNX, and exporting data to Arrow format. Overall, the search engine offers a unique and efficient way to search Wikipedia.

https://www.leebutterman.com/2023/06/01/offline-realtime-embedding-search.html

To top