Neural Magic introduces Sparse Llama 3.1 8B, the first sparse foundation model based on Meta’s Llama 3.1 8B, boasting 98% recovery on Open LLM Leaderboard v1 and full recovery across fine-tuning tasks like math, coding, and chat. The model features hardware-accelerated sparsity optimized for NVIDIA Ampere GPUs, delivering faster throughput and lower latency. It is also quantization-compatible, offering advanced 4-bit quantization methods for efficient inference improvement. Sparse Llama aims to reduce the size and cost of large language models while maintaining accuracy. The model’s performance across benchmarks and inference scenarios demonstrate its robustness and efficiency, paving the way for more accessible and scalable AI advancements.
https://neuralmagic.com/blog/24-sparse-llama-smaller-models-for-efficient-gpu-inference/