S-LoRA: Serving Concurrent LoRA Adapters

S-LoRA is a system designed for scalable serving of many LoRA adapters, which are derived from a base model using the parameter-efficient fine-tuning method called Low-Rank Adaptation (LoRA). S-LoRA stores all adapters in main memory and fetches the ones needed for currently running queries to GPU memory. It introduces Unified Paging, a mechanism that manages dynamic adapter weights and KV cache tensors with different ranks and sequence lengths to efficiently use GPU memory and reduce fragmentation. S-LoRA also employs tensor parallelism and optimized CUDA kernels for efficient batched inference. Compared to other libraries, S-LoRA significantly improves throughput and can serve thousands of LoRA adapters on a single GPU. The system requires a CUDA 11.8 compatible GPU and PyTorch version between 1.13 and 2.0.1 for installation.

https://github.com/S-LoRA/S-LoRA

To top