In their paper, “StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models,” authors Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, and Nima Mesgarani introduce StyleTTS 2, a text-to-speech (TTS) model that uses style diffusion and adversarial training with large speech language models to achieve human-like TTS synthesis. The model incorporates latent random variables to generate suitable styles for the text without reference speech, resulting in efficient latent diffusion and diverse speech synthesis. They also employ large pre-trained speech language models as discriminators, leading to improved speech naturalness. StyleTTS 2 outperforms human recordings on the LJSpeech dataset and matches it on the multispeaker VCTK dataset. Additionally, it surpasses previous publicly available models for zero-shot speaker adaptation when trained on the LibriTTS dataset. The research demonstrates the potential of style diffusion and adversarial training with large speech language models in achieving human-level TTS synthesis on both single and multispeaker datasets.
Audio samples and the paper can be found on their respective links provided. The authors also share instructions
https://github.com/yl4579/StyleTTS2