The author’s method utilizes existing modulation mechanisms in DiT models to extrac concept-specific information from user-provided images. A pre-trained text-to-image DiT model processes both image and text tokens through modulation, attention, and feed-forward modules. The focus is on the modulation block, where tokens are modulated via a vector derived from a pooled text embedding. The TokenVerse method learns personalized modulation vector adjustments for each text embedding based on concept images and captions. These adjustments represent personalized directions in the modulation space and are learned using a simple reconstruction objective. During inference, the learned direction vectors are added to text embeddings to inject personalized concepts into generated images.
https://token-verse.github.io/