This repository introduces Large Concept Models (LCM), which operates on a higher-level semantic representation called “concepts” that transcend language and modality. Concepts are equated to sentences in the SONAR embedding space, accommodating multiple languages. The LCM, a sequence-to-sequence model, trains on auto-regressive sentence prediction, utilizing approaches like MSE regression and diffusion-based generation. With 1.6B parameter models and 1.3T tokens of training data, the repository offers recipes to replicate training and fine-tuning. The detailed steps for installation, preparation, pre-training, finetuning, and evaluating models are provided, along with guidelines for contributing and proper citation. This innovative approach to language modeling warrants exploration.
https://github.com/facebookresearch/large_concept_model