Our newly launched torchao PyTorch native library offers techniques like low bit dtypes, quantization, and sparsity to make models faster and smaller. Surprisingly, our techniques show minimal drops in accuracy when benchmarked on popular GenAI models. From 97% speedup for Llama 3 to 50% reduction in model VRAM for CogVideoX, torchao provides impressive results. We offer various quantization algorithms for inference and training, including Quantization Aware Training (QAT) to recover accuracy degradation. Our APIs are composable, allowing for combinations of techniques to further enhance model performance. If you’re interested in optimizing your models, try torchao now.
https://pytorch.org/blog/pytorch-native-architecture-optimization/