DiffBIR: High Quality Blind Image Restoration with Generative Diffusion Prior

Updated on September 13, 2024

Technical Evangelist // AI Arcanist

DiffBIR: High Quality Blind Image Restoration with Generative Diffusion Prior

Introduction

On this blog, we have long espoused the utility of Stable Diffusion for a wide variety of computer vision tasks, not just text to image synthesis. Namely, Stable Diffusion has also proven to be an extremely capable tool for image editing, 3D modeling, and much more.

Furthermore, image upscaling and blind image restoration remain one of the most visible and utilitarian applications of AI available to a consumer today. Since last years GFPGAN and Real ESRGAN, efforts in this field have proven extremely capable in tasks like background detail restoration and face upscaling. Blind Image Restoration is the domain of AI that seeks to handle these and similar tasks.

In this blog post, we will look at one of the latest and greatest efforts to tackle this task: DiffBIR. By leveraging the extreme capability of the Stable Diffusion model, DiffBIR enables simplistic and easy to implement image restoration for both general image restoration and faces. We will finish with a few examples we made using the new technology to upscale our images.

Prerequisites

To fully grasp the concepts presented in “DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior,” we highly recommend foundational understanding of the following topics:

Image Restoration Techniques: Familiarity with basic image restoration tasks, such as denoising, deblurring, and super-resolution, along with common methodologies used in these areas (e.g., convolutional neural networks, GANs).
Deep Learning Fundamentals: A solid understanding of deep learning concepts, particularly neural networks, backpropagation, and optimization techniques. Knowledge of frameworks like PyTorch or TensorFlow will be beneficial.
Diffusion Models: Understanding the principles of generative diffusion models, including their training process, how they generate samples by reversing a stochastic process, and how they differ from GANs and VAEs.
Probability and Statistics: Familiarity with probability theory, especially concepts such as Gaussian distributions, stochastic processes, and Markov chains, which are often used in the context of diffusion models.
Optimization Algorithms: Familiarity of optimization methods like stochastic gradient descent (SGD) and Adam, and how they are applied in training deep learning models.
Machine Learning for Vision Tasks: Basic knowledge of machine learning applications in computer vision, particularly regarding tasks that involve handling degraded or noisy images.

DiffBIR Model Architecture

DiffBIR is comprised of two stage pipeline. In the first Stage, a series of operations are performed on the image to first generate a degraded representation of the original high quality image in low quality. The blur-resize-noise process occurs three times. The pretrained restoration model then works to first remove the degradations in the low quality images. The generative model then reproduces the lost information, which forces the latent diffusion model’s restoration process to focus on the texture and detail generation without being affected by noise. This promotes a much more robust reconstruction.

To achieve this, they use a modified SwinIR as the restoration model. Specifically, they made several changes to utilize the pixel unshuffle operation to downsample the input _I_LQ by a factor of 8. Next, a 3 × 3 convolutional layer is adopted to improve shallow feature extraction. All the subsequent transformer operations are performed in low resolution space, which is similar to latent diffusion modeling. The deep feature extraction adopts several Residual Swin Transformer Blocks
(RSTB), and each RSTB has several Swin Transformer Layers (STL). Check out this blog post for a details breakdown on SwinIR for more information about what this entails. The shallow and deep features are added in order to maintain both the low-frequency and high-frequency information. To upsample the deep features back to the original image space, the model performs nearest interpolation three times. Each interpolation is followed by one convolutional layer as well as one Leaky ReLU activation layer. They also optimized the parameters of the restoration module by minimizing the L2 pixel loss. This process can be represented by the equation below:

For Stage 2, the pipeline uses the output from stage 1 obtained by regression learning and to use for the fine-tuning of the latent diffusion model. This is known as the condition latent when the diffusion model’s VAE maps this into the latent space. This follows the standard diffusion model process, where the diffusion and denoising processes are performed in the latent space by adding Gaussian noise with variance at each step t to the encoded latent z = E(x) for
producing the noisy latent, as represented by:

A network ϵθ is learned by predicting the noise ϵ conditioned on c (i.e., text prompts) at a randomly picked time-step t. The optimization of latent diffusion model is defined as follows:

Since the stage 1 restoration process tends to leave an overly smoothed image, the pipeline then works to leverage the pre-trained Stable Diffusion for image reconstruction with the obtained _I_reg -_I_HQ pairs. First, they utilize the encoder of Stable Diffusion’s pretrained VAE to map _I_reg into the latent space, and obtain the condition latent E (equivalent to _I_reg ). Then, the UNet runs typical latent diffusion. In parallel, there is an additional path that contains the same encoder and middle block as the UNet denoiser. There, it concatenates the condition latent E (_I_reg ) with the randomly sampled noisy zt as the input for the
parallel module. The outputs of the parallel module are added to the original UNet decoder. Moreover, one 1 × 1 convolutional layer is applied before the addition operation for each scale.

During fine-tuning, the parallel module and these 1 × 1 convolutional layers are optimized simultaneously, where the prompt condition is set to empty. The model aims to minimize the following latent diffusion objective. The obtained result in this stage is denoted as _I_diff, and represents the final restored output. Together, this process is called LAControlNet by the original authors.

To summarize the process, only the skip-connected features in the UNet denoiser are tuned for our specific task. This strategy alleviates overfitting when dealing with our small training dataset, while allowing inheritance of the capability for high-quality generation from Stable Diffusion. The conditioning mechanism is more straightforward and effective for image reconstruction task
compared to other methods like ControlNet, which utilizes an additional condition network trained from scratch for encoding the condition information.

In DiffBIR’s LAControlNet, the well-trained VAE’s encoder is able to project the condition images into the same representation space as the latent variables. This
strategy significantly alleviates the burden on the alignment between the internal knowledge in latent diffusion model and the external condition information. In practice, directly utilizing ControlNet for image reconstruction leads to severe color shifts as shown in the ablation study. In practice, this full pipeline process allows for the extremely high quality blind image restoration that the model boasts.

DiffBIR Demo

Now that we went over the underlying concepts behind DiffBIR, lets take a look at the model in action. To do this, we are going to run the DiffBIR demo provided by the original repo authors.

We ran our tests for this demo on a single A100-80GB machine.

Setup

We can begin by running the first 2 code cell. The first will install the required packages to run the demo, and the second will download all the model checkpoints required. We recommend skipping the second cell on subsequent runs to avoid the roughly 5 minute download. The code for this cell may be found below:

!pip install -r requirements.txt 
!pip install -U gradio

Then to download the models from HuggingFace, we run the next cell:

!mkdir weights
%cd weights
!wget https://huggingface.co/lxq007/DiffBIR/resolve/main/general_swinir_v1.ckpt
!wget https://huggingface.co/lxq007/DiffBIR/resolve/main/general_full_v1.ckpt
!wget https://huggingface.co/lxq007/DiffBIR/resolve/main/face_swinir_v1.ckpt
!wget https://huggingface.co/lxq007/DiffBIR/resolve/main/face_full_v1.ckpt
%cd ..

Running the demo

Now that we have everything setup, we can get started. In the next code cell, all we need to do is run the cell to get our demo spun up.

!python gradio_diffbir.py \
--ckpt weights/general_full_v1.ckpt \
--config configs/model/cldm.yaml \
--reload_swinir \
--swinir_ckpt weights/general_swinir_v1.ckpt \
--device cuda

Testing with the demo

Default main page for UI

To test out the demo, we found a free icon image of a city with a mountainous background. We recommend recreating the demo to get a feel for how the model works. There is a copy of the image we used and its prompt below:

prompt: a city with tall buildings, forest trees, snowy mountain background

We ran our Blind Image Restoration test on the above image with the following settings:

SR Scale (how many times larger to make the image): 4
Image size (output size before scaling, in pixels): 512
Positive prompt (for Stable Diffusion guidance): a city with tall buildings, forest trees, snowy mountain background
Negative prompt: longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality
Prompt Guidance Scale (how much effect the prompt has on upscaling, 0 will remove effect): 1
Control Strength (how much original image guides reconstruction for LAControlNet): 1
Steps: 50
Seed: 231

Before and After comparison

As we can see, the model succeeds in accomplishing a great deal of upscaling. For fine details, we can see evidence of the models efficacy especially by zooming in on the snowcaps of the mountains and the office windows. The level of detail there is very complete. Of course, it isn’t perfect. Notice the pointed roof buildings towards the left of the image. In the blind image restoration version on the right, the roofs have taken on a strange slope and blending effect with the building behind it.

As for more coarse details, the shadows on the side of the mountain and contiguity of the blue sky are great evidence for the models efficacy. Again, the opposite can be seen in the forest greenery below the buildings. These appear almost like bushes rather than sets of full trees.

All in all, from a qualitative perspective, there is a minimal uncanny valley effect. The only real presence we can see of it is with those curved roofs. Otherwise, from our perspective, this seems like an excellent tool for quick photo upscaling. When combined with other tools like Real ESRGAN and GFPGAN, we may see these capabilities taken even further.

We recommend testing the full face and full models on a variety of test images with different parameters to get better results. We hope this tool can be a great new addition to users arsenals for image manipulation with AI.

Conclusion

DiffBIR offers a really valuable new tool for image restoration with AI. For both faces and general images, the technique shows incredible promise. In the coming weeks, we plan to test this technique out on old family photos to see how its capabilities stack up.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products