OmniGen Next Generation Image Generation on Cloud GPUs

Published on November 1, 2024

Technical Evangelist // AI Arcanist

OmniGen Next Generation Image Generation on Cloud GPUs

Between Stable Diffusion 3.5 and FLUX.1, this year has seen another renaissance for text-to-image generation. These models have given us a step forward in prompt adherence, added the capability to spell with open-source models, and continued to improve in the quality of their aesthetic outputs. Nonetheless, the core mechanic behind these models has remained fundamentally the same: use some, with either an empty latent image or image primer, to generate a single image.

In this article, we want to shine a spotlight on an incredibly promising new architecture for text-to-image generation, OmniGen. Inspired by similar efforts in the Large Language Model research community, OmniGen is the first fully unified diffusion model framework for all sorts of additional downstream tasks like image editing, subject-driven generation, and visual-conditional generation Source.

Follow along for a breakdown of the architecture that makes OmniGen possible, an exploration of the models capabilities, and a demonstration for how to run OmniGen and test the model using a DigitalOcean GPU Droplet.

Prerequisites

Python: The content of this article is highly technical. We recommend this piece to readers experienced with both Python and basic concepts in Deep Learning. For new users, this article may be a good place to start.
Cloud GPU: Running FLUX.1 will require a sufficiently powerful GPU. We recommend at least 40 GB VRAM machines at the minimum.

OmniGen Framework

OmniGen architecture

OmniGen is comprised of a two part composite of a Variational AutoEncoder (VAE) and large pretrained Transformer. The VAE functionally extracts visual features from the images continuously, while the Transformer generates images based on the input conditions. Specifically, these are the VAE from Stable Diffusion XL which was frozen during training, and the Transformer model was initialized with Microsoft’s Phi-3. This allows a functional connection between the strength of the VAE with the a Transformer that has inherited a signficant understanding of textual processing capability. Put together, this creates a simple but strong pipeline that prevents the need for additional encoders in the model, which simplifies the pipeline significantly. OmniGen inherently encodes conditional information by itself. “Furthermore, OmniGen jointly models text and images within a single model, rather than independently modeling different input conditions with separate encoders as in existing works which lacks interaction between different modality conditions” Source.

For the attention mechanism, they use a modified version of the causal attention mechanism. Specifically, the mechanism works by applying causal attention to each element in the sequence, and simultaneously applu bidirectional attention within each image sequence. This makes it possible for each patch to “pay attention” to the other parts of the image and ensure that each image can only consider image or text sequences that have previously appeared Source.

To generate an image, they randomly sample a Gaussian noise and apply flow matching to predict the target velocity, and iterate the set number of inference steps to generate the images latent representation. The VAE then decodes the value into the final image output Source.

What can we do with OmniGen?

OmniGen is capable of numerous tasks, but, more importantly, abstracts away extra steps from the increasibly long process of image generation/editing with AI technologies. Let’s briefly overview each of them before jumping into the coding demo.

Text-to-image generation: Like Stable Diffusion or FLUX, OmniGen is perfectly capable of generating high quality images on its own with a high degree of performance. In our experience, the quality is very familiar to that of using the baseline Stable Diffusion XL. It will be interesting to see if the same process could be applied to the more modern FLUX and SD 3.5 Large models to give this same capability to those models

Text based image editing: OmniGen makes it easy to edit images in a single step with text instruction. It uses an LLM style prompting format, so asking the model to change the image subject’s hair color is a simple request

Image compositing: OmniGen makes it easy to combine two subject matters seamlessly into novel environments

Depth/pose estimation: OmniGen comes integrated with the same technologies that make ControlNet’s so effective, and are capable of doing the extraction of the pose themselves

and much more! OmniGen is probably the most versatile single model pipeline ever released, be sure to try all the examples provided by the demo

OmniGen Code Demo

Setup the GPU Droplet

Now that we have walked through everything OmniGen brings to the table, we are ready to begin the code demo. In order to proceed, we highly recommend using a DigitalOcean GPU Droplet powered by cloud GPU to run this demo. It will guarantee significantly faster results than the HuggingFace Space.

To follow along on a GPU Droplet, please consult this tutorial on setting up the environment, and follow the steps within, before continuing.

Install the packages & clone the repository

Once you have succesfully SSH’d into your Droplet, we can continue. First, we need to make sure we are in the right directory & clone our repo. Paste the following command into the terminal:

cd ../home
sudo apt-get install git-lfs
git-lfs clone https://huggingface.co/spaces/Shitao/OmniGen
cd OmniGen/
pip3 install requirements.txt

This will do everything we need to setup the environment with all the packages needed to run OmniGen. All that’s left is to run the demo!

python3 app.py --share

Click the shareable public link to open the Gradio application in any browser window.

Running the demo

Now we are ready to run the demo! Begin by testing generation with a simple text prompt. We found the results very similar to baseline SDXL. Afterwards, test out the provided examples at the bottom of the page to get a feel for using the model. We recommend using their skeleton’s to construct any new generation prompts you use going forward, as the authors have found the best ways to use their own model.

Closing Thoughts

OmniGen is a fascinating step forward for image generation models. In particular, we are impressed by the consolidation of the entire pipeline into a single model file capable of such a diverse array of tasks, including editing, image composition, and much more. We look forward to the release of the next versions of OmniGen in the coming months as the pipeline framework is spread to other models.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products