How Do AI Models Generate Videos? An In-Depth Look
By Will Douglas Heaven, MIT Technology Review — September 12, 2025
Artificial intelligence (AI) has made spectacular strides in recent years, especially in the realm of generating video content. Recent advancements have placed powerful video generation tools in the hands of more people than ever before, enabling casual users and professionals alike to create video clips that are nearly indistinguishable from actual filmed footage or traditional CGI animation. From OpenAI’s Sora and Google DeepMind’s Veo 3 to Runway’s Gen-4, the AI video generation landscape is booming, with Netflix even debuting AI visual effects in a mass-market TV show, The Eternaut.
Although these demo clips often showcase only the best outputs, widespread access through apps like ChatGPT and Gemini means that an average user can now prompt AI to produce intriguing video content—albeit with varying degrees of success and often requiring multiple attempts.
This article unpacks the technology behind AI-generated videos, explains why the process is both powerful and energy-intensive, and highlights recent breakthroughs that are pushing the field forward.
The Basics of AI Video Generation
For most casual users, video generation is as simple as typing a prompt like, “Make me a video of a unicorn eating spaghetti,” into an app. But behind this seemingly straightforward interaction lies complex machinery involving advanced AI models called latent diffusion transformers.
What Is a Diffusion Model?
Diffusion models are at the core of modern AI video (and image) generation. Imagine starting with an image and then repeatedly adding random noise (like static on an old TV). After many iterations, the original image becomes pure noise.
A diffusion model is a neural network trained to reverse this process: it learns to start from random noise and gradually refine it back into a coherent image. During training, the model is shown millions of images at various noise levels and learns how to “denoise” them step by step.
When you request an image from such a model, it begins with a noisy pattern and methodically clears away the noise until a new image emerges—similar to those in its training data.
However, you don’t just want any image; you want one that matches your text prompt. The diffusion model is paired with another AI, often a large language model (LLM), which understands text descriptions and guides the image-generation process towards pictures that fit the prompt.
From Images to Video
While the concept is straightforward with images, video generation must handle sequences of frames that maintain continuity—objects, lighting, and motion must be consistent from one frame to the next.
Traditional diffusion models are computationally intense, especially for videos, since they involve dealing with millions of pixels per frame, multiplied by numerous frames.
Enter Latent Diffusion
To tackle the immense computational demands, most video generation models use latent diffusion techniques. Instead of operating on raw pixel data, these models compress video frames and text prompts into a mathematical representation called a latent space, capturing only essential features and discarding extraneous details. This compression dramatically reduces the amount of processing power needed.
Once the latent diffusion model has generated a suitable compressed video, the data is decompressed back into a watchable format, transforming the latent representation into full-resolution video frames.
The Role of Transformers: Latent Diffusion Transformers
Generating coherent videos isn’t just about creating individual frames—it requires ensuring consistency across the entire sequence. This is where transformer models come into play.
Transformers excel at processing long sequences of data and maintaining context throughout, which is why they’re used in popular language models like OpenAI’s GPT-5 and Google DeepMind’s Gemini. OpenAI’s Sora model innovatively combines diffusion with transformers to handle video data sliced into “cubes” of space and time—effectively treating video clips as sequences of chunks.
This hybrid approach helps maintain object permanence, consistent lighting, and other spatial-temporal coherence, resulting in videos where elements don’t randomly appear or disappear. Furthermore, it allows training on diverse video formats—from widescreen films to vertical smartphone clips—enhancing the model’s versatility.
Adding Audio: Moving Beyond the Silent Era
A remarkable breakthrough introduced by Google DeepMind’s Veo 3 is the ability to generate videos with synchronized audio, including lip-synced dialogue, sound effects, and ambient noise.
By compressing audio and video data into a unified format within the diffusion model, Veo 3 can generate sound and images simultaneously, ensuring perfect synchronization. As Google DeepMind CEO Demis Hassabis noted, this development marks the end of the “silent era” of video generation.
Energy Consumption Concerns
Despite these advances, video generation remains significantly more energy-intensive than generating text or images. The enormous computational workload stems from processing sequences of frames and audio and compressing/decompressing large data sets during generation.
The Future: Blurring Lines Between Diffusion and Transformers
While diffusion models currently dominate image, audio, and video generation, large language models rely primarily on transformers for text generation. However, researchers are exploring hybrid techniques, such as Google DeepMind’s experimental LLM that uses diffusion models to generate text, potentially offering higher efficiency.
Conclusion
AI video generation now incorporates intricate combinations of diffusion models, latent space compressions, and transformer architectures to create seamless, high-quality videos with synchronized audio. As this technology reaches more users and powers new applications—from entertainment to communication—its capabilities and impact will only grow.
While challenges like energy consumption and misinformation risks remain, the ongoing innovations promise a fascinating future where AI-generated videos become an integral part of digital creativity.
For more insights and explanations on emerging technologies, visit MIT Technology Review’s Explains series.