Video generative models have made tremendous strides recently. Although advancements in language modeling are impressive, with their capacity to tackle more intricate tasks, generating realistic videos poses a unique challenge. As humans, our brains have evolved over millions of years to instinctively detect even the slightest visual inconsistencies, making realistic video generation a remarkably complex task. In a previous article, we discussed HunyuanVideo, a leading open-source vision language model that has caught up to impressive closed source models like Sora and Veo 2.
Beyond the typical use case of entertainment, areas of increased research interest for video generation models include predicting protein folding dynamics and modeling real-world environments for embodied intelligence (ex: robotics, self-driving cars). Advances in video generation models could be instrumental for scientific research and our ability to develop complex physical systems.
On February 26th, 2025, a collection of open-source video foundation models Wan 2.1 were released. The series consists of four distinct models divided into two categories: text-to-video and image-to-video. The text-to-video category includes T2V-14B and T2V-1.3B, while the image-to-video category features I2V-14B-720P and I2V-14B-480P. These models range in size from 1.3 billion to 14 billion parameters. The larger 14B model particularly excels in scenarios requiring high motion, producing videos at 720p resolution with realistic physics. Meanwhile, the smaller 1.3B model offers an excellent compromise between quality and efficiency, allowing users to generate 480p videos on standard hardware in approximately four minutes.
On February 27th, 2025, Wan2.1 was integrated into ComfyUI, an open source node-based interface for creating images, videos, and audio with GenAI, and on March 3rd, 2025, Wan2.1’s T2V and I2V were integrated into Diffusers, a popular Python library developed by Hugging Face that provides tools and implementations for diffusion models.
In this figure, we can see that with fewer parameters, Wan-VAE achieves a higher efficiency (frame/latency) and a comparable peak signal-to-noise ratio (PSNR) as Hunyuan video.
There are two parts to this tutorial. (1) An overview covering the model architecture and training methodology and (2) an implementation where we run the model. Note that the overview section of this article might receive an update once Wan 2.1’s full technical report is released. For the first part of the tutorial, an understanding of deep learning fundamentals is critical for following along with the theory. Some exposure to concepts discussed in this tutorial may be helpful (ex: autoencoders, diffusion transformers, flow matching). To complete the second part of this tutorial, a GPU is required. If you don’t have access to a GPU, consider signing up for a DigitalOcean account to utilize a GPU Droplet. Feel free to skip the overview section if you’re only interested in implementing Wan 2.1.
An Autoencoder is a neural network designed to replicate its input as its output. For instance, an autoencoder can convert a handwritten digit image into a compact, lower-dimensional representation known as a latent representation, then reconstruct the original image. Through this process, it learns to compress data efficiently while minimizing errors in reconstructing an image. Variational Autoencoders (VAEs), on the other hand, encode data into a continuous, probabilistic latent space rather than a fixed, discrete representation as with traditional autoencoders. This allows for the generation of new, diverse data samples and smooth interpolation between them, critical for tasks like image and video generation.
Causal convolutions are a type of convolution specifically designed for temporal data, ensuring that the model’s predictions at any given timestep t are only dependent on past timesteps (t-1, t-2, …) and not on any future timesteps (t+1, t+2, …).
(Source) Standard (left) vs. Causal (right) Convolutions
Causal convolutions are applied in different dimensions to various data types.
A 3D Causal Variational Autoencoder, as implemented with Wan 2.1, is an advanced type of VAE that incorporates 3D causal convolutions, allowing it to handle both spatial and temporal dimensions in video sequences.
This novel 3D causal VAE architecture, termed Wan-VAE, can efficiently encode and decode 1080P videos of unlimited length while preserving historical temporal information, making it suitable for video generation tasks.
Processing long videos in one pass can lead to GPU memory overflow due to high-resolution frame data and temporal dependencies. Thus, the causal convolution module has a feature cache mechanism to provide historical data without retaining full video in memory. Here, video sequence frames are structured in a “1 + T” input format (1 initial frame + T subsequent frames), dividing the video into “1 + T/4” chunks.
For example: A 17-frame video (T=16) becomes 1 + 16/4 = 5 chunks.
Each encoding/decoding operation processes a single video chunk at a time, which corresponds to a single latent representation. To reduce the risk of GPU memory overflow, the number of frames in each processing chunk is limited to a maximum of 4. This frame limit is determined by the temporal compression ratio, which measures the compression of the time dimension.
The T2V models generate videos from text prompts.
Component | Description |
---|---|
Diffusion Transformers (DiT) + Flow Matching ![]() |
Wan AI 2.1 is built on the principles of mainstream Diffusion Transformers, incorporating the Flow Matching framework to achieve advanced capabilities. Diffusion Transformers (DiTs) specifically refer to transformer architectures applied to diffusion models, which are models that generate data by adding and then removing noise from training data. Flow Matching is where a neural network is directly trained to predict smooth, continuous transformations between simple and complex data distributions. This results in more stable training, faster inference, and improved performance compared to conventional diffusion-based approaches. |
T5 Encoder and Cross-Attention for Text Processing ![]() |
Wan2.1 employs the T5 (Text-To-Text Transfer Transformer) Encoder (UMT5) to embed text into the vision model. Within each transformer, a cross-attention mechanism is utilized. This attention variant enhances the model’s ability to process multilingual text inputs (English and Chinese) and align the text prompts with visual outputs. |
Time Embeddings ![]() |
Time embeddings in Wan 2.1 serve as numerical representations that capture temporal patterns in data. These embeddings are processed through a shared MLP (consisting of a Linear layer and SiLU activation) used across all transformer blocks. The shared MLP reduces parameter count and boosts efficiency, while individual transformer blocks develop unique biases for specialized processing. This dual approach allows for both parameter efficiency and specialized processing, as each block can focus on different aspects of the input data, enhancing the model’s overall performance. |
The I2V models generate videos from images using text prompts.
Component | Description |
---|---|
Condition Image | Video synthesis is controlled with a condition image as the first frame |
Guidance Frames | Frames filled with zeros (guidance frames) are concatenated with the previously generated condition image along the temporal axis |
Condition Latent Representation | A 3D VAE is used to compress the guidance frames into a condition latent representation |
Binary Mask | A binary mask is added (1 for preserved frames, 0 for frames to generate). This mask, spatially aligned with the condition latent representation, extends temporally to match the target video’s length |
Mask Rearrangement | The binary mask is then reshaped to align with the VAE’s temporal stride, ensuring seamless integration with the latent representation |
DiT Model Input | The noise latent representation, condition latent representation, and the rearranged binary mask are combined by concatenating them along the channel axis. This combined input is then fed into the DiT model |
Channel Projection | Due to the increased channel count compared to T2V models, a supplementary projection layer, initialized with zeros, is implemented to adapt the input for the I2V DiT model |
CLIP Image Encoder | A CLIP image encoder extracts feature representations from the condition image, capturing its visual essence |
Global Context MLP | These extracted features are projected by a three-layer MLP, generating a global context that encapsulates the image’s overall information |
Decoupled Cross-Attention | This global context is then injected into the DiT model via decoupled cross-attention, allowing the model to leverage the condition image’s features throughout the video generation process |
Wan 2.1 offers flexible implementation options. In this tutorial, we’ll utilize Comfy UI to showcase a seamless way to run the Wan 2.1 I2V model. Before following along with this tutorial, set up a GPU Droplet and find a picture you’d like to convert into a video.
For optimal performance, we recommend selecting an “AI/ML Ready” OS and utilizing a single NVIDIA H100 GPU for this project.
apt install python3-pip
pip install comfy-cli
comfy install
Select nvidia when prompted “What GPU do you have?”
cd comfy/ComfyUI/models
wget -P diffusion_models https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/diffusion_models/wan2.1_i2v_480p_14B_fp8_e4m3fn.safetensors
wget -P text_encoders https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors
wget -P clip_vision https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/clip_vision/clip_vision_h.safetensors
wget -P vae https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors
comfy launch
You’ll see a URL in the console output. You’ll need this URL in a later step to access the GUI.
In VSCode, click on “Connect to…” in the Start menu.
Choose “Connect to Host…”.
Click “Add New SSH Host…” and enter the SSH command to connect to your droplet. This command is usually in the format ssh root@[your_droplet_ip_address]. Press Enter to confirm, and a new VSCode window will open, connected to your droplet.
You can find your droplet’s IP address on the GPU droplet page.
In the new VSCode window connected to your droplet, type >sim and select “Simple Browser: Show”.
Copy the ComfyUI GUI URL from your web console (from Step 3) and paste it into the Simple Browser.
Press Enter, and the ComfyUI GUI will open.
Click the Manager button in the top right corner. In the menu that pops up, click Update ComfyUI.
You’ll be prompted to restart ComfyUI. Click “Restart” and refresh your browser if needed.
Download the workflow of your choice in Json format (here, we’re using the I2V workflow).
If you’re working with a workflow that requires additional nodes, you might encounter a “Missing Node Types” error. Go to “Manager” > “Install missing custom nodes” and install the latest verisions of the required nodes.
You’ll be prompted to restart ComfyUI. Click “Restart” and refresh your browser if needed.
Positive Prompt vs. Negative Prompt A “positive prompt” tells a model what to include in its generated output, essentially guiding it towards a specific desired element, while a “negative prompt” instructs the model what to exclude or avoid, acting as a filter to refine the content by removing unwanted aspects.
We will be using the following prompts to get our character to wave: Positive Prompt “A portrait of a seated man, his gaze engaging the viewer with a gentle smile. One hand rests on a wide-brimmed hat in his lap, while the other lifts in a gesture of greeting.”
Negative Prompt “No blurry face, no distorted hands, no extra limbs, no missing limbs, no floating hat”
To run the workflow, select Queue. If you run into errors, ensure the correct files are passed into the nodes.
Would you look at that - we got our character to wave at us.
Feel free to play around with the different parameters to see how performance is altered.
Great work! In this tutorial, we explored the model architecture of Wan 2.1, a cutting-edge collection of video generative models. We also successfully implemented Wan 2.1’s image-to-video model using ComfyUI. This achievement underscores the rapid advancement of open-source video generation models, foreshadowing a future where AI-generated video becomes an integral tool in various industries, including media production, scientific research, and digital prototyping.
https://github.com/Wan-Video/Wan2.1
https://stable-diffusion-art.com/wan-2-1/
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!