The Llama2 Embedding Server is a powerful tool for obtaining text embeddings using different LLMs. It optimizes the process by caching embeddings in SQLite, avoiding redundant computations. It offers various endpoints, including computing semantic similarity between text strings, performing semantic search using FAISS vector searching, and generating embeddings for plaintext or PDF files. The server also supports token-level embeddings, providing more nuanced information at the expense of compute and storage requirements. It introduces combined feature vectors for comparing strings of different lengths. The server is built on FastAPI, ensuring scalability, concurrency, and interactive API documentation. It can be easily configured and offers comprehensive logging and exception handling. The application also includes performance optimizations like asynchronous programming, database optimizations, RAM disk utilization, model caching, and parallel inference. It can be run using Docker for easy setup and deployment.
https://github.com/Dicklesworthstone/llama_embeddings_fastapi_service