The rise of text-to-image models marks a transformative shift in the field of artificial intelligence, unlocking new possibilities for creative expression and communication. These models leverage advanced deep learning techniques to generate realistic and contextually relevant images based on textual input. The integration of natural language processing and computer vision has paved the way for applications that can interpret and translate textual descriptions into visually compelling representations. As these models continue to evolve and improve, they hold the potential to revolutionize various industries, including design, entertainment, and education, by providing a seamless bridge between the world of language and imagery.
DeciDiffusion is an open source cutting edge text-to-image latent diffusion model trained on a subset of the LAION dataset and fine-tuned on the LAION-ART dataset. This diffusion model with 1.02 billion parameter, surpasses the 1.07 billion parameter Stable Diffusion v1.5 (SD), which is of a similar size, while achieving equivalent quality in image generation with 40% fewer iterations. The model has also proven to be 3X faster than Stable Diffusion v1.5, when run on NVIDIA A10G GPUs. This performance is due to the advanced Neural Architecture Search Technology architecture of the model, which was developed for optimal efficiency.
DeciDiffusion performance benchmarking with SD (Source)
DeciDiffusion’s enhanced capabilities are more profound than those of SD. Let us discuss the implications of DeciDiffusion briefly.
The text-to-image generation model holds immense potential in the field of design, art, advertising, content creation and many more. The rise of this technology lies in its seamless ability to effortlessly convert text into vibrant images, representing a significant advancement in AI capabilities. While SD being open source has spurred numerous innovations, it takes a backseat when it comes to the practical challenges in deployment due to its demanding computational requirements, though the rise of Turbo models and distillation may prove that assumption wrong.
These challenges result in noticeable latency and cost issues during training and deployment. In contrast, DeciDiffusion stands out for its superior computational efficiency, ensuring a smoother user experience and an impressive reduction of nearly 66% in production costs. The outcome is a more accessible and feasible landscape for text-to-image generative applications running with the model versus other Latent Diffusion models.
In this article, we will look at what makes DeciDiffusion so powerful and versatile, and then show it with a practical demonstration.
DeciDiffusion 1.0, a text-to-image generation model, builds upon Stable Diffusion’s core architecture, incorporating advancements like the U-Net-NAS design by Deci. This substitution optimizes the model for greater computational efficiency by reducing parameter count while retaining the Variational Autoencoder (VAE) and CLIP’s Text Encoder.
DeciDiffusion is also a latent diffusion model, like Stable Diffusion, however the architecture is based on U-Net-NAS. Latent diffusion models are probabilistic frameworks capable of producing high-quality images. They initiate the image generation process by transforming random noise into realistic images through a gradual diffusion process. The distinctive feature of these models lies in applying the diffusion process to an encoded latent representation of the image rather than the raw pixel values.
Here are the main steps involved:
(a) U-Net architecture (b) The U-like backbone of Nas-Unet architecture (Source)
In this architecture two types of cell architectures are defined, called DownSC and UpSC based on U-like backbone. DeciDiffusion stands out for its unique feature: the flexible composition of each block, optimizing the number of ResNet and Attention blocks for peak performance with minimal computations. By adopting the efficient U-Net-NAS in DeciDiffusion, characterized by fewer parameters, the model reduces computational demands, making it a more resource-efficient alternative to Stable Diffusion.
The model has been trained on 4 stages:
Requirements
Phase 1
Phase 2-4
Hardware
8 x 8 x A100 (80gb)
8 x 8 x H100 (80gb)
Optimizer
AdamW
LAMB
Batch
8192
6144
Learning rate
1e-4
5e-3
The hardware requirements for training DeciDiffusion were quite high, requiring significant computational power to handle the intensive processing involved. Therefore, we recommend using a GPU Droplet by DigitalOcean to streamline your workflow and optimize performance without investing in costly infrastructure.
Follow the steps to use this model and produce some mind blowing images!
*Install the necessary packages
#install the packages using pip
!pip install --quiet git+https://github.com/huggingface/diffusers.git@d420d71398d9c5a8d9a5f95ba2bdb6fe3d8ae31f
!pip install --quiet ipython-autotime
!pip install --quiet transformers==4.34.1 accelerate==0.24.0 safetensors==0.4.0
!pip install --quiet ipyplot
%load_ext autotime
*Import the Libraries and necessary packages
#import necessary libraries
from diffusers import StableDiffusionPipeline, DiffusionPipeline
import torch
import ipyplot
import time
*Loads the pre-trained checkpoint, “DeciDiffusion-v1-0” for a Stable Diffusion pipeline. Run the model with two prompts. The resulting images are stored in the ‘img’ and ‘img2’ variables.
#set the device and load the pre-trained model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
checkpoint = "Deci/DeciDiffusion-v1-0"
#biuld the decidiffusion pipeline
pipeline = StableDiffusionPipeline.from_pretrained(checkpoint, custom_pipeline=checkpoint, torch_dtype=torch.float16)
pipeline.unet = pipeline.unet.from_pretrained(checkpoint, subfolder='flexible_unet', torch_dtype=torch.float16)
pipeline = pipeline.to(device)
#generate images by passing prompt
img = pipeline(prompt=['A photo of an astronaut riding a horse on Mars']).images[0]
img2 = pipeline(prompt=['A big owl with bright shinning eyes']).images[0]
*Images Produced by DeciDiffusion
Time taken by SD model to generate the images:
Images Generated by SD. Image creation times were 2.96 , 2.93, 2.94, and 2.93 seconds
Time taken by DeciDiffusion model to generate the images:
Images Generated by DeciDiffusion. Image creation times were 1.11, 1.08, 1.09, and 1.08 seconds.
DeciDiffusion’s improved latency is a result of advancements in its architecture, efficient training techniques enhancing sample efficiency, and the integration of Infery, Deci’s easy to use SDK, can increase this even further. This combination results in significant cost savings during inference operations. Firstly, it provides flexibility in hardware selection, enabling a transition from high-end A100/H100 GPUs to the more budget-friendly A10G without sacrificing performance (we still recommend using an A100-80G or H100). Moreover, when compared on the same hardware, DeciDiffusion proves highly cost-effective, with a 66% reduction in cost compared to Stable Diffusion for every 10,000 generated images.
DeciDiffusion, represents a crucial advancement for generative AI applications. This not only optimizes real-time projects in content creation and advertising but also leads to substantial reductions in operational costs. In this article we compared DeciDiffusion with SD and it can be concluded that the model is faster and more efficient than SD to both train and use for inference. However it is worth mentioning that the model is not intended to generate accurate or truthful representations of people or events. Therefore, employing it for such purposes goes beyond its designated capabilities. Also, the model has its own limitations. Here are few of them:
Nevertheless, the model has its own perks especially when it comes to its computation power and cost effectiveness. Along with this article we have provided two notebooks based on DeciDiffusion and Stable Diffusion. We encourage users to utilize these notebooks in conjunction with the article for an enriched experience.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!