Stable Diffusion 3.5 Large with DigitalOcean GPU Droplets

Published on October 25, 2024

Technical Evangelist // AI Arcanist

Stable Diffusion 3.5 Large with DigitalOcean GPU Droplets

The release of Stable Diffusion 3.5 Large has already made massive waves around the image generation community. Offering comparable performance to top models like FLUX.1 Dev, MidJourney v6, and Ideogram v2, SD3.5 Large offers some of the same powerful prompt understanding, versatile styling, and spelling capability that the best closed source models can provide. After FLUX’s recent dominance, this represent’s an awesome return to form for StabilityAI.

In this article, we will show how to run Stable Diffusion 3.5 Large from a DigitalOcean GPU Droplet. We will start with a quick breakdown of what is new with Stable Diffusion 3.5 Large, and then walkthrough a full demo with both the Diffusers code and show how to run the model with ComfyUI. Readers can expect to leave with a full understanding of how to run the new model and generate images of any kind with GPU Droplets.

For more information about GPU Droplets, please visit the landing page & check out our breakdown of what makes our GPU’s so powerful.

Prerequisites

Python: The content of this article is highly technical. We recommend this piece to readers experienced with both Python and basic concepts in Deep Learning. For new users, this beginner tutorial may be a good place to start.
Cloud GPU: Running FLUX.1 will require a sufficiently powerful GPU. We recommend at least 40 GB VRAM machines at the minimum.

What’s new in Stable Diffusion 3.5 Large?

To start, let’s breakdown what has been introduced in this latest release of Stable Diffusion.

Since it’s initial public release, v1-4, we have now seen several generations of the model. v1-5 was the first SOTA open-source image generation model & popularized the technology, v2 models upscaled resolutions up to 768x768 pixels, and XL upscaled the Unet by 3x and integrated an additional text encoder (OpenCLIP ViT-bigG/14) to massively improve prompt adherence.

Now, with Stable Diffusion 3.5 Large, the developers have taken things even further. Namely, they advertise:

Greater prompt adherence: exceptional ability to understand the textual meaning of the prompt and translate that into carefully connected, visual features
Spelling capability: SD 3.5 models are capable of spelling words in different fonts in natural styling
Diverse outputs & versatile styles: compared to other best-in-class models, SD 3.5 Large outputs are far more likely to render diverse faces, objects and structures. The model is also capable of numerous artistic and visual styles, something we found missing with FLUX

MultiModal - Diffusion Transformer

So how does Stable Diffusion 3.5 Large achieve this? No paper has been released yet, but analysis of the HuggingFace page’s graphic has allowed us to glean a few additional insights.

First, we can infer that a number of these improvements come from using a triple text encoder setup with Clip_L, Clip_G, and T5 text encoders. This ensemble methodology allows for a better unified understanding of the prompt in the latent space. As all diffusion models do, the latents from the text encoders are then used as input, along with an empty latent image.

The next major innovation seems to be from the development of a novel MM-DiT blocks. First introduced for Stable Diffusion 3, it uses seperate weights for the two modalities. This effectively means there are two independent transformers for each modality, and they are joined by the attention mechanism. This allows each representation to be calculated in it’s own space while mutually effecting one another. This allows information to “flow” between text and image tokens to improve overall comprehension and typography of the results (Source). Much of the architecture for SD3.5 appears to be the same as the original SD3 model.

Finally, based on the obvious improvements from SD3 to SD3.5, we can also infer that significant work has been done to further train the model for longer and on a wider corpus of data. This is implied by the great versatility it has when composing images of diverse styles.

Overall, Stable Diffusion 3.5 Large is a very powerful model that has stepped up to meet the standards set by the competition. Read on further to learn how to generate your own images with Stable Diffusion 3.5 Large in a GPU Droplet.

How to open a DigitalOcean GPU Droplet & setup the environment

To set up your environment for Stable Diffusion 3.5 Large, we are going to need sufficient compute resources. We recommend an NVIDIA H100 GPU, or at the very least an A100 or A6000. We recommend accessing these machines through the cloud using a remote provider like DigitalOcean.

If you are creating a DigitalOcean GPU Droplet, this tutorial on setting up your GPU Droplet environment has a full breakdown on setting up the Droplet, accessing your Droplet from your local machine using SSH, and spinning up an accessible Jupyter Notebook with Visual Studio Code and your browser.

Running Stable Diffusion 3.5 Large Diffusers code in a Jupyter Notebook

Once your Jupyter Lab notebook is open, we can begin generating! But first, we need to make sure the packages we need are up to date. Paste the following code into the first code cell to update/install the package:

!pip install diffusers

Diffusers is a powerful library provided by our friends at HuggingFace that makes using any diffusion model simple, and their commitment to making StabilityAI models useable has been a massive boon to the industry. We are going to use the following snippet of diffusers code to generate an image of a cartoon owl wearing a shirt that says “I <3 DigitalOcean!”:

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

image = pipe(
    'a woman wearing a shirt that says "I <3 DigitalOcean!"',
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]
image.save("woman.png")

Running Stable Diffusion 3.5 Large with the ComfyUI

The ComfyUI, who partners directly with StablityAI as well, is the best way to run Stable Diffusion 3.5 Large for numerous reasons, but the primary one is integration with other tools in a no-code environment. We have spoken at length about the effectiveness of the UI when we discussed using FLUX with the platform, and many of the same strengths hold true with SD 3.5 Large.

To get started, clone the repo onto your machine using the following command in your terminal:

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip3 install -r requirements.txt

This will install all the required packages as well, if any are missing. Next, we need to download our models. To access Stable Diffusion 3.5 Large, we need to accept the licensing agreement at their HuggingFace site. Once that’s complete, we can download our models to the cache, and then copy it to our ComfyUI model directory using the following command:

huggingface-cli download stabilityai/stable-diffusion-3.5-large sd3.5_large.safetensors
cp ~/.cache/huggingface/hub/models--stabilityai--stable-diffusion-3.5-large/snapshots/ceddf0a7fdf2064ea28e2213e3b84e4afa170a0f/sd3.5_large.safetensors ./models/checkpoints/
cp ~/.cache/huggingface/hub/models--stabilityai--stable-diffusion-3.5-large/snapshots/ceddf0a7fdf2064ea28e2213e3b84e4afa170a0f/clip_g.safetensors ./models/clip/
cp ~/.cache/huggingface/hub/models--stabilityai--stable-diffusion-3.5-large/snapshots/ceddf0a7fdf2064ea28e2213e3b84e4afa170a0f/clip_l.safetensors ./models/clip/
cp ~/.cache/huggingface/hub/models--stabilityai--stable-diffusion-3.5-large/snapshots/ceddf0a7fdf2064ea28e2213e3b84e4afa170a0f/t5xxl_fp16.safetensors ./models/clip/

Note: the ‘ceddf0a7fdf2064ea28e2213e3b84e4afa170a0f’ directory name is subject to change. You can get the correct value by hitting tab repeatedly after typing cp ~/.cache/huggingface/hub/models--stabilityai--stable-diffusion-3.5-large/snapshots/.

Finally, we can launch the UI with the following command:

python3 main.py

This will output a local URL like http://127.0.0.1:8188. Copy that value, and open your Visual Studio Code window connected to the remote via SSH. Just like opening a Jupyter Lab window, we can paste this value into the simple browser (accessible with ctrl+shift+p or cmd+shift+p) to open the ComfyUI in our default browser window while it is connected with the GPU.

With that completed, we are ready to begin generating! Download the following image file, and click the load button in the far right of the screen, and load in this image. It will create a workflow that we can recreate the same image shown below.

We recommend trying all sorts of prompts to test out the versatility of the model. We were very impressed with our experiments! Here are some additional tips to help you get started:

Resolution & size: the model is incredibly versatile with regards to different resolutions, but we did find that it wasn’t as widely ranging as FLUX models. Keep generations below 1600 pixels in any given axis to avoid distorted images
Negative prompting: long negative prompts tend to break generations, and it seems like negative prompts are not as strongly effective as previous releases. That being said, it is far more effective than experiments to give FLUX models the same capability
Spelling: to get words spelled by the model, add quotations marks and words like spell or caption around the desired quotes

Overall, in our experience, this is the best way to run Stable Diffusion 3.5 Large with GPUs on the cloud!

Closing Thoughts

In conclusion, Stable Diffusion 3.5 Large is a true step forward for open source text-to-image modeling. We are excited to see where the community takes its development in the coming months, and even more excited for Stable Diffusion 3.5 Medium to release on October 29!

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products