Between Stable Diffusion 3.5 and FLUX.1, this year has seen another renaissance for text-to-image generation. These models have given us a step forward in prompt adherence, added the capability to spell with open-source models, and continued to improve in the quality of their aesthetic outputs. Nonetheless, the core mechanic behind these models has remained fundamentally the same: use some, with either an empty latent image or image primer, to generate a single image.
In this article, we want to shine a spotlight on an incredibly promising new architecture for text-to-image generation, OmniGen. Inspired by similar efforts in the Large Language Model research community, OmniGen is the first fully unified diffusion model framework for all sorts of additional downstream tasks like image editing, subject-driven generation, and visual-conditional generation Source.
Follow along for a breakdown of the architecture that makes OmniGen possible, an exploration of the models capabilities, and a demonstration for how to run OmniGen and test the model using a DigitalOcean GPU Droplet.
OmniGen is comprised of a two part composite of a Variational AutoEncoder (VAE) and large pretrained Transformer. The VAE functionally extracts visual features from the images continuously, while the Transformer generates images based on the input conditions. Specifically, these are the VAE from Stable Diffusion XL which was frozen during training, and the Transformer model was initialized with Microsoft’s Phi-3. This allows a functional connection between the strength of the VAE with the a Transformer that has inherited a signficant understanding of textual processing capability. Put together, this creates a simple but strong pipeline that prevents the need for additional encoders in the model, which simplifies the pipeline significantly. OmniGen inherently encodes conditional information by itself. “Furthermore, OmniGen jointly models text and images within a single model, rather than independently modeling different input conditions with separate encoders as in existing works which lacks interaction between different modality conditions” Source.
For the attention mechanism, they use a modified version of the causal attention mechanism. Specifically, the mechanism works by applying causal attention to each element in the sequence, and simultaneously applu bidirectional attention within each image sequence. This makes it possible for each patch to “pay attention” to the other parts of the image and ensure that each image can only consider image or text sequences that have previously appeared Source.
To generate an image, they randomly sample a Gaussian noise and apply flow matching to predict the target velocity, and iterate the set number of inference steps to generate the images latent representation. The VAE then decodes the value into the final image output Source.
OmniGen is capable of numerous tasks, but, more importantly, abstracts away extra steps from the increasibly long process of image generation/editing with AI technologies. Let’s briefly overview each of them before jumping into the coding demo.
Now that we have walked through everything OmniGen brings to the table, we are ready to begin the code demo. In order to proceed, we highly recommend using a DigitalOcean GPU Droplet powered by cloud GPU to run this demo. It will guarantee significantly faster results than the HuggingFace Space.
To follow along on a GPU Droplet, please consult this tutorial on setting up the environment, and follow the steps within, before continuing.
Once you have succesfully SSH’d into your Droplet, we can continue. First, we need to make sure we are in the right directory & clone our repo. Paste the following command into the terminal:
cd ../home
sudo apt-get install git-lfs
git-lfs clone https://huggingface.co/spaces/Shitao/OmniGen
cd OmniGen/
pip3 install requirements.txt
This will do everything we need to setup the environment with all the packages needed to run OmniGen. All that’s left is to run the demo!
python3 app.py --share
Click the shareable public link to open the Gradio application in any browser window.
Now we are ready to run the demo! Begin by testing generation with a simple text prompt. We found the results very similar to baseline SDXL. Afterwards, test out the provided examples at the bottom of the page to get a feel for using the model. We recommend using their skeleton’s to construct any new generation prompts you use going forward, as the authors have found the best ways to use their own model.
OmniGen is a fascinating step forward for image generation models. In particular, we are impressed by the consolidation of the entire pipeline into a single model file capable of such a diverse array of tasks, including editing, image composition, and much more. We look forward to the release of the next versions of OmniGen in the coming months as the pipeline framework is spread to other models.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Sign up for Infrastructure as a Newsletter.
Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.