SoundStorm: Efficient Parallel Audio Generation

Google Research has developed SoundStorm, a model for efficient and non-autoregressive audio generation. This model takes the semantic tokens of AudioLM as input and utilizes bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, SoundStorm is much faster, producing audio of the same quality and with higher consistency in voice and acoustic conditions. In fact, SoundStorm can generate 30 seconds of audio in just 0.5 seconds on a TPU-v4. The model has also demonstrated its ability to scale audio generation to longer sequences, synthesizing high-quality, natural dialogue segments using a transcript annotated with speaker turns and a short prompt with the speakers’ voices.

https://google-research.github.io/seanet/soundstorm/examples/