VideoGPT is a generative model that extends the GPT architecture to video data, enabling the creation of natural videos through likelihood-based modeling. It employs a Vector Quantized-Variational AutoEncoder (VQ-VAE) with 3D convolutions and axial self-attention to learn compressed video representations. An autoregressive GPT-like model processes these latent representations with spatio-temporal encodings to generate high-fidelity videos. VideoGPT has demonstrated performance comparable to Generative Adversarial Networks (GANs) in video quality, producing high-fidelity videos from datasets such as UCF-101 and TGIF.