In our last tutorial, we showed how to use Dreambooth Stable Diffusion to create a replicable baseline concept model to better synthesize either an object or style corresponding to the subject of the inputted images, effectively fine-tuning the model. Other attempts to fine-tune Stable Diffusion involved porting the model to use other techniques, like Guided Diffusion with glid-3-XL-stable. While effective, this strategy is prohibitively computationally expensive to run for most people without access to powerful Datacenter GPUs. Dreambooth’s robust strategy requires only 16 GB of GPU RAM to run, a significant decrease from these other techniques. This gives a far wider range of users a much more affordable and accessible entrypoint for joining in on the rapidly expanding community of Stable Diffusion users.
The other popular method for achieving a similar result that we can try is Textual Inversion. It is similarly expensive in terms of computation, so it represents a great additional option for tuning the model. This is a bit of a misnomer however, as the diffusion model itself isn’t itself tuned. Rather, Textual inversion “learns to generate specific concepts, like personal objects or artistic styles, by describing them using new “words” in the embedding space of pre-trained text-to-image models. These can be used in new sentences, just like any other word.” [Source] In practice, this gives us the other end of control over the stable diffusion generation process: greater control over the text inputs. When combined with the concepts we trained with Dreambooth, we can begin to really influence our inference process.
In this tutorial, we will show how to train Textual Inversion on a pre-made set of images from the same data source we used for Dreambooth. Once we have walked through the code, we will demonstrate how to combine our new embedding with our Dreambooth concept in the Stable Diffusion Web UI launched from a Jupyter Notebook.
Once we are in our Notebook, we can scroll to the first code cell to begin the necessary installs for this Notebook. We will also create a directory to hold our input files for training. In the next cell, we import these packages and define a useful helper function for displaying our images later on.
Now that we have set up the work space, we need to load in our models.
To access the models, we can clone the repository directly from HuggingFace. Use the following code in the terminal to clone the repo:
apt-get install git-lfs && git-lfs clone https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5
As mentioned earlier, textual inversion involves training the Stable Diffusion model to better recreate a set of image’s distinct features when generating from the same model by functionally creating a brand new word token for the model to ascribe these features to. Primarily then, we need to source data that represents the concept we want to embody.
For this demonstration, we are going to use images of a plastic toy Groot from the Guardians of the Galaxy films. We have provided sample code to make accessing these images easy.
Now that we have our URLs, use the block of code below to download them to your desired save_path
. We will use inputs_textual_inversion
that we created earlier.
This will save the files to your directory, and display a grid example of their selection like so:
Sample Groot photos
We now need to define our concept for the model to understand. We first establish our concept_name
and initializer_token
variables. The initializer token acts a word that summarizes as closely as possible the object or style of the concept. We then define whether or not the concept is an object, likely a physical object in the selection of images, or a style, a consistent pattern or style across each of the images.
The concept name is also used as our placeholder_token
. This is used in place of other words across a selection of standardized sentences that help the model physically place the features with the prompts. We will name our concept ‘grooty,’ and use the initializer token ‘groot’.
If you would like to use your own custom data in place of the demonstration values, you may upload them to the directory “./inputs_textual_inversion” and change the variables above as needed.
Now that we have set up our inputs, we need to establish the sentences our placeholder_token
will be used across. These sample sentences have a great affect on the overall capability of the textual inversion process, so consider modifying them as needed for using different types of images. For example, the following will work perfectly well for this demo, but would be unhelpful for trying to create a textual inversion embedding of a person.
These prompt templates are separated into object and style listings. Now we can use them with custom dataset classes to facilitate passing them to the model.
This dataset object ensures that the image inputs are optimally run with textual inversion by transforming them and reshaping them as needed to increase overall model acuity during training.
Now, if we are running this on a Jupyter Notebook, then we have two choices. Earlier we mounted the models in the Public Datasets directory, and they can be accessed at ../datasets/stable-diffusion-diffusers/stable-diffusion-v1-5
from the working directory.
If we want to use the online version from the Runway ML repo, then we can hash out the lower line. This will download the models to the cache, and will count towards storage. You will need to paste your Huggingface token in the cell at the top of the Notebook titled “Alternate access: log in to HuggingFace for online access to models.”
Next, we will load in the CLIPTokenizer from the model’s directory. We can then add our new token in as a novel token. This way, as we train, the new token will be able to become associated with the features in the images.
We will then encode the initializer_token
and placeholder_token
to get their token id’s. If more than one token is generated, then it will prompt us to enter a single token instead. This would likely be caused by something like entering a phrase or sentence as the placeholder token.
Finally, we load in our text_encoder
, vae
, and unet
subcomponents of the Stable Diffusion V1-5 model.
Since we have added the placeholder_token
in the tokenizer
, we need to re-size the token embeddings here, and create a new embedding vector in the token embeddings for our placeholder_token
. We can then initialize the newly added placeholder token with the embeddings of the initializer token.
Since we are only training the newly added embedding vector, we can then freeze the rest of the model parameters here. With this, we have completed setting up the training dataset, and can load it in. We will use this to create our dataloader for training.
Before we can begin running training, we need to define our noise scheduler and training hyperparameters, and create the training function itself.
We will use the DDPMScheduler for this example, but other schedulers like PLMS may yield better results. Consider choosing different schedulers to see how the training results differ.
We next set our hyperparameters for training. In particular, consider altering the max_train_steps
and seed
to better control the outcome of the embedding. Higher training step values will lead to a more accurate representation fo the concept, and altering the seed will change the ‘randomness’ the diffusion model is using to construct the sample images to calculate the loss. Additionally, we can change the train_batch_size
if we are on a GPU with more than ~16GB of VRAM, and change the output_dir
to wherever we choose.
Here is a rough breakdown of what is happening in the block above:
.eval()
as they aren’t to be trainedtotal_batch_size
from the train_batch_size
multiplied times the numbers of ongoing processes (machines doing training) and gradient_accumulation_steps
num_train_epochs
we calculated from the total training stepstext_encoder
, and proceed to step through each of the inputs batch by batchlearned_embeds.bin
file to the directory we defined as output_dir
Now that we have run and tried to understand each of the steps the code is taking to generate our embedding, we can run our training function with accelerate to get our image embedding using the code cell below:
The embedding for this demo is saved to concepts/grooty-demo/learned_embeds.bin
.
We can use the StableDiffusionPipeline
function to sample Stable Diffusion models with our newly trained image embedding. First, we need to instantiate the pipeline.
We can then sample the model using our placeholder_token
value from earlier to impart the qualities and features of our image embedding to the model’s outputs. For this sample, we will use the prompt “a translucent jade chinese figurine of placeholder_token
, HDR, productshot render, Cinema4D, caustics”
If everything ran correctly, you should get a 5x4 grid of images like those below:
a 4x5 grid of samples from this demo
As we can see, the model has clearly been able to understand the features of the Groot toy. Notably, the large pupil-less eyes and pointy head structure came through in nearly every photo.
Try increasing or decreasing the max_train_steps
variable to see how the fit of the embedding is affected by increased training. Be wary of overfitting as well, as there is a possibility that the model will become unable to generalize things like background features if there is too much consistency in the features in your image. For example, a training dataset composed of people standing outside a specific building in every photo will likely yield that building as a feature of the embedding in addition to the object.
Now that we have our new embedding, we can also use it with our Dreambooth model trained in the last session using the Stable Diffusion Web UI. All we need to do is move it to the Web UI’s embeddings folder, and we can use this embedding with any model we have with the Web UI, including Dreambooth checkpoints.
learned_embed.bin
file in the concept folder, concepts/grooty-concept
if you followed the demogrooty.bin
Then use your placeholder token in any prompt to get your embedding featured! Use the cell below to move the demo textual inversion embedding to the Stable Diffusion Web UI repo:
Finally, we can launch the Web UI. Here, if you have saved your Dreambooth concept as well, we can now combine the effects of the two different training methods. Here is an example using the prompt “a translucent jade chinese figurine of a grooty sks, HDR, productshot render, Cinema4D, caustics” using our toy Cat Dreambooth model. Remember that grooty
represents our new image embedding, and sks
prompts the Cat toy object.
Textual Inversion on Groot toy with Dreambooth run on Cat toy
As we can see, we are getting a primary presentation of the cat toy object with distinct features like color and the eyes and hand beginning to look more like the Groot toy. If we were to increase the weight on either prompt (which is done using a parentheses “()” like around this interjection), we could increase the presentation of either feature.
In the last two articles, we looked at training two different methods for fine-tuning stable diffusion. The first, Dreambooth did proper fine-tuning to generate a new checkpoint. Today we examined Texual Inversion, which instead learns to generate specific concepts, like personal objects or artistic styles, by describing them using new “words” in the embedding space of pre-trained text-to-image models.
Readers of this tutorial should now be prepared to train Textual Inversion on their own inputted images, and then use these embeddings to generate images using both the StableDiffusionPipeline
and the Stable Diffusion Web UI. We encourage you to keep experimenting with these Notebooks, and get unprecedented levels of control over your Stable Diffusion outputs using Textual Inversion and Dreambooth.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!