This technical report focuses on our method for turning visual data into a unified representation for large-scale training of generative models. We also evaluate the capabilities and limitations of our model, Sora. Previous work in generative modeling of video data has focused on specific types of visual data or restricted video lengths. Sora, however, is a generalist model that can generate videos and images of different durations, aspect ratios, and resolutions. We use a patch-based representation to train Sora on diverse types of videos and images. Sora is a diffusion model that scales effectively for video generation and exhibits interesting emergent capabilities.
https://openai.com/research/video-generation-models-as-world-simulators