SPAD method synthesizes 3D views of objects from text prompts, generating images from various camera angles after training on only four views. The model uses a pipeline to fine-tune a pre-trained text-to-image model, denoise multi-view images, apply 3D self-attention, and add Plücker Embeddings for camera control. The method achieves competitive results on 3D consistency and novel view synthesis tasks. Unique aspects include the use of epipolar attention for better camera control, prevention of content copying and generation of flipped views through design choices like Plücker Embeddings. The method showcases smooth transition between views and outperforms MVDream in several aspects.
https://yashkant.github.io/spad/