Images that Sound: Generating spectrograms that are also images

This paper explores the concept of creating images that can also be heard as sound using diffusion models. Spectrograms are usually representations of sound but look entirely different from visual images. The authors demonstrate the possibility of generating spectrograms that mimic natural images and sound like natural audio simultaneously. By denoising noisy latent variables with both image and audio diffusion models, they are able to achieve this unique composition of visuals and sounds. The method is described as zero-shot, meaning there is no need for training or fine-tuning. Examples of such “images that sound” can be found in the provided gallery.

https://ificl.github.io/images-that-sound/