In this paper, a new 3D generation framework called GaussianAnything is introduced, which addresses challenges in input formats, latent space design, and output representations. The framework utilizes a Variational Autoencoder (VAE) with RGB-D renderings as input to create a unique latent space that allows for multi-modal conditional 3D generation. The proposed method shows superior results compared to existing approaches, particularly in text- and image-conditioned 3D generation tasks. The native 3D diffusion model in GaussianAnything offers improved 3D consistency and editing capabilities. Concurrent studies also explore similar native 3D diffusion models for enhanced 3D object generation.
https://nirvanalan.github.io/projects/GA/