Moshi: A speech-text foundation model for real time dialogue

Moshi is a cutting-edge speech-text foundation model and spoken dialogue framework that utilizes Mimi, an advanced neural audio codec. Moshi predicts text tokens corresponding to its own speech and inner monologue, enhancing the quality of its output. The model achieves a theoretical latency of 160ms with practical latency as low as 200ms on an L4 GPU. Mimi, the neural audio codec, uses a distillation loss to match a self-supervised representation from WavLM, allowing it to model semantic and acoustic information effectively. The repository includes Python, MLX, and Rust versions, each offering different models like Moshiko and Moshika. Surprisingly, Mimi achieves strong subjective quality improvements despite its low bitrate through adversarial training and feature matching. Note: While Windows support is limited, MacBook Pro users can utilize the MLX version.

https://github.com/kyutai-labs/moshi