Unified Multimodal LLM with Discrete Sequence Modeling by Jun Zhan, Junqi Dai, and Jiasheng Ye from Fudan University introduces AnyGPT, a versatile multimodal language model. AnyGPT utilizes discrete representations for processing speech, text, images, and music without altering existing large language model architecture. AnyGPT relies on data-level preprocessing, enabling seamless integration of new modalities. A multimodal text-centric dataset is created for alignment pre-training. AnyGPT generates any-to-any multimodal conversations that intertwine different modalities, showcasing the model’s ability to handle diverse inputs and outputs effectively. Experimental results show AnyGPT’s capability in facilitating multimodal conversations, rivaling specialized models’ performance across all modalities. The unique synthesis process and diverse demonstrations underline the model’s innovative approach to multimodal understanding and generation.
https://junzhan2000.github.io/AnyGPT.github.io/