Back to All Resources

LaVie

LaVie is an advanced text-to-video (T2V) generation framework that leverages pre-trained text-to-image (T2I) models to produce visually realistic and temporally coherent videos. It operates through cascaded latent diffusion models, including a base T2V model, a temporal interpolation model, and a video super-resolution model. Key features of LaVie include the incorporation of temporal self-attentions with rotary positional encoding to capture temporal correlations and joint image-video fine-tuning to enhance creative output. Additionally, LaVie utilizes a comprehensive dataset named Vimeo25M, consisting of 25 million text-video pairs, to ensure quality, diversity, and aesthetic appeal in generated videos.