NVIDIA Sana - A Foundation Image Generation Model at Lightning Speeds

Published on December 13, 2024

Technical Evangelist // AI Arcanist

NVIDIA Sana - A Foundation Image Generation Model at Lightning Speeds

The race to create the greatest image generation model continues onward, and it only grows more heated. This year, we have seen the rise of FLUX to replace the total dominance of Stable Diffusion XL in the open source community, seen Ideogram and ReCraft introduce next gen models on the closed source side that blow expectations out of the water, and seen numerous smaller projects break the mold in their own ways across a variety of different sub-tasks.

In this article, we want to introduce you to one of those mold breakers that has caught our attention: NVIDIA Sana. This incredibly fast model, while very newly released, offers a plethora of important traits we believe will beome standard with subsequent SOTA model releases.

Follow along with us in this article for a detailed explanation of what makes Sana powerful, capable, and different from other current popular releases, and see why it might be the right model for your image generation workflow. Afterwards, we will show in detail how to run the Sana models on a DigitalOcean cloud GPU Droplet.

What makes Sana different from FLUX and Stable Diffusion?

To begin, we need to articulate how Sana is different from its predecessors.

In practice, Sana is a text-to-image diffusion model capable of creating images at high resolutions (4096x4096) at lightning fast speeds. These speeds and high resolutions are made possible by several new developments the NVIDIA team has made to improve on the original Latent Diffusion Model designs. These include, but are not limited to:

First, Sana uses a unique deep compression autoencoder design that allows its images to be compressed up to 32x during processing, compared to 8x in traditional autoencoders. This reduces the number of latent tokens needed to be processed during generation while conserving the images features to a surprisingly high degree.

Second, they replaced regular attention with a linear attention mechanism in the Diffusion Transformer (DiT) for all attention. In practice, this reduced the complexity of the attention mechanism from from O(N^2) to O(N) while simultaneously achieving comparable results for higher resolution generations with typical attention.

Third, they replaced the T5 text encoder with a smaller model, Gemma. This allows for complex, human-like inputs to be more easily processed by the model accurately.

Finally, the model was trained using a novel, efficient paradigm using their Flow-DPM-Solver to reduce sampling steps. They argue that this allows Sana to compete with FLUX v1 models at 1/20th their size.

In the following sections, we will elaborate on how these differentiating capabilities are made possible through Sana’s novel architectural pipeline.

Sana Pipeline Breakdown

In this section, let’s take a deeper look at some of those features we listed above.

The Sana Model Architecture: Autoencoder

Unlike previous designs, the architecture of the autoencoder for the Sana model uses an aggressive, 32x compression technique. Furthermore, they found that the autoencoders should take full responsibility for compression, allowing the latent diffusion models to focus solely on denoising. This design effectively reduces the tokens required by 4x and decreases associated GPU memory costs during training. In practice, this methodology allows them to bridge the gap between powerful model’s autoencoders like SDXL at a fraction of the cost, for both training and inference.

The Sana Model Architecture: the Linear Diffusion Transformer

After compression by the autoencoder, the image data is processed in the pipeline is processed by the Diffusion Transformer block. In Sana, unlike other models, this block uses a linear attention mechanism; which achieves higher computational efficiency in high-resolution generation without affecting performance. Additionally it uses the MIX-FFN (Feed Forward Network), which has a 3x3 depth convolution for better token information aggregation.

The Sana Model Architecture: Replacing T5

At this stage, the authors argue that T5 (first proposed in 2019) is insufficiently powerful for modern, SOTA understanding of human language. To ameliorate this, they integrate the Gemma encoder from Google to better “follow complex human instructions by using Chain-of-Thought (CoT) and and In- context-learning (ICL)” (source). In practice, this allows for more human like speech inputs to be interpretted by the model. Rather than saying “cat with sign drawing”, we could say “draw me a cat holding a sign with its paws” to add style and detail to the image.

How to run Sana on a DigitalOcean GPU Droplet

To get started with Sana, we need sufficient GPU compute. We highly recommend using the DigitalOcean Cloud GPU Droplets, if you do not already have access to a GPU in your local environment. For more details on setting up and getting started with GPU Droplets, please check out this tutorial before proceeding.

Once your GPU Droplet has spun up, proceed to the next section using the console.

Install Conda

To facilitate installation, we recommend installing Miniconda onto the machine. This process takes only a couple minutes, and will allow the Sana environment to automatically be configured later. Paste the following into your terminal.

cd ../home
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Follow the instructions to install Miniconda on the machine, which will facilitate with all the further setup of our environment. Select yes when prompted to complete installation.

Setup Sana environment

Next, we are going to setup the Sana environment on our GPU Droplet. To do this, paste the following code into your terminal window.

git clone https://github.com/NVlabs/Sana.git
cd Sana
./environment_setup.sh sana
conda activate sana

Using Conda, the web application will now be automatically setup and installed. Once this process is completed, we can run Sana now through whatever pipeline we choose. These include the official Gradio Sana demo, running Sana with the sana_pipeline from the developers, using the ComfyUI to generate images, and more. In this tutorial, we will cover the former two options, and briefly show how to run the ComfyUI workflow afterwards.

Run Sana in the web application

The fastest way, we found through trial and error, to generate images with Sana is using their official Gradio application demo. Additionally, the no-code interface will make this a favorite way to deploy the model to anyone without programming experience.

To launch the web UI, paste the following into the terminal. Note that this will require logging into the huggingface-hub to access the Gemma model download. It can be logged into using huggingface-hub login in the terminal & pasting a read-only token for HuggingFace in. Follow the instructions to create one if needed.

DEMO_PORT=15432 \
python3 app/app_sana.py \
    --share \
    --config=configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
    --model_path=hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth

Once your downloads are complete and the web application has spun up, we can access it using the shared link.

Here we can enter in our prompt to generate images. We recommend testing out the different advanced options using the toggle at the bottom of the page. There, you can find sliders for values like the height and width of the outputs, the seed, guidance scales, and the number of generated images. We found that at the max settings, 4 images at 4096x4096p, we were able to generate the images in around 10 seconds; that translates roughly to speeds as high as 1.634 s/Img on a single GPU Droplet. This is an enormously fast speed improvement over FLUX and Stable Diffusion models, which could take upwards of 5 minutes for similar synthesis tasks with lower quality results.

Run Sana in Jupyter Lab with Python

The next way to run Sana on a cloud GPU Droplet, would be through a jupyter notebook. To continue, we first need to install Jupyter onto our machine. Alternatively, we can use Visual Studio Code on our local machine via SSH.

pip3 install jupyterlab
jupyter lab --allow-root

From here, we can create a new iPython Notebook file. Open it, and then paste the following into the first cell.

import torch
from app.sana_pipeline import SanaPipeline
from torchvision.utils import save_image

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
sana = SanaPipeline("configs/sana_config/1024ms/Sana_1600M_img1024.yaml")
sana.from_pretrained("hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth")

This will initiate the model pipeline for us, and should take a few moments to around 5 minutes depending on whether the models are already in the HuggingFace cache. Next, paste the following code to generate a new image.

import random 
val = random.randint(0,100000000)
generator = torch.Generator(device=device).manual_seed(42)

prompt = 'a cyberpunk cat with a neon sign that says "Sana"'

image = sana(
    prompt=prompt,
    height=4096,
    width=4096,
    guidance_scale=5.0,
    pag_guidance_scale=2.0,
    num_inference_steps=40,
    generator=generator,
    
)
save_image(image, 'sana.png', nrow=1, normalize=True, value_range=(-1, 1))

If everything runs correctly, it will generate the following image:

Like with the web application, we can change these values to control this output. Along with the prompt and height values, try adjusting the val value to create a re-createable image. We can also change the guidance scale and PAG (Perturbed-Attention Guidance) guidance values to adjust the strength of the prompt on the final output.

ComfyUI

Finally, we will show how to run the Sana model with the increasingly popular and ubiquitous ComfyUI. For more detail, check out these articles to learn how to run ComfyUI from scratch with FLUX and Stable Diffusion 3.5 Large.

For Sana, running it in the ComfyUI is actually something the Comfy Devs have automated. We simply need to follow their guide provided here. This process can be initiated with the following code:

cd /home
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
git clone https://github.com/Efficient-Large-Model/ComfyUI_ExtraModels.git custom_nodes/ComfyUI_ExtraModels
pip3 install -r requirements.txt
python3 main.py

This will download all of the relevant files & then launch the Web UI. To access this, we need to use a SSH tunnel to VS Code as shown in this article. Follow the steps shown in the article, and paste the URL generated (http://127.0.0.1:8188) into the simple browser input. This will open the ComfyUI in our local browser. Next, download the json file here, and load it into the ComfyUI. If successful, everything should look like the image shown below:

From here, we can begin generating! This is probably the slowest generation method we have found in our experiments so far, but it is familiar to many users. We can expect to see rapid improvements on this process as the open source community adopts Sana in coming weeks.

Closing Thoughts

In conclusion, Sana is a very interesting project that aims to challenge current SOTA models to achieve higher generation latency at higher resolutions. If fine-tuned models for Sana become adopted by the wider Stable Diffusion community, it could actually present a reasonable challenge to existing models thanks to its incredible speed. Thanks to NVIDIA for open-sourcing this amazing work!

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products