Sesame CSM: A Conversational Speech Generation Model

Sesame is releasing the 1B CSM variant, a Conversational Speech Model that generates audio codes from text and audio inputs. The model uses a Llama backbone and a smaller audio decoder to produce Mimi audio codes. A fine-tuned variant powers an interactive voice demo. A hosted Hugging Face space for testing audio generation is available. Requirements include a CUDA-compatible GPU and access to specific Hugging Face models. The CSM is best used with context provided, and it does not support text generation or other languages well. Misuse of the model for impersonation, fraud, misinformation, illegal activities is strictly prohibited.

https://github.com/SesameAILabs/csm