Latte is a novel Latent Diffusion Transformer designed for video generation. It first extracts spatio-temporal tokens from input videos and then employs a series of Transformer blocks to model video distribution in the latent space. To efficiently handle the substantial number of tokens extracted from videos, Latte introduces four efficient variants by decomposing the spatial and temporal dimensions of input videos. Comprehensive evaluations demonstrate that Latte achieves state-of-the-art performance across standard video generation datasets, including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. Additionally, Latte extends to text-to-video generation tasks, achieving comparable results to recent T2V models.