Video generation models as world simulators

[ad_1]

This technical report focuses on (1) our technique for turning visible knowledge of all kinds right into a unified illustration that permits large-scale coaching of generative fashions, and (2) qualitative analysis of Sora’s capabilities and limitations. Mannequin and implementation particulars are usually not included on this report.

A lot prior work has studied generative modeling of video knowledge utilizing a wide range of strategies, together with recurrent networks,[^1][^2][^3] generative adversarial networks,[^4][^5][^6][^7] autoregressive transformers,[^8][^9] and diffusion fashions.[^10][^11][^12] These works usually give attention to a slender class of visible knowledge, on shorter movies, or on movies of a hard and fast measurement. Sora is a generalist mannequin of visible knowledge—it might probably generate movies and pictures spanning numerous durations, facet ratios and resolutions, as much as a full minute of excessive definition video.