Image Super-Resolution (ISR) is the process of improving the quality and resolution of a low-resolution (LR) image to a high-resolution (HR) version. This technique enhances finer details, sharpness, and clarity, making it highly valuable in various fields. ISR is widely used in medical imaging to improve the accuracy of diagnoses, in satellite imaging to extract finer geographical details, and in security applications for enhancing surveillance footage. It also plays a crucial role in digital media, helping upscale old or low-quality images and videos while maintaining visual fidelity. With advancements in deep learning and neural networks, modern ISR methods, such as convolutional neural networks (CNNs) and generative adversarial networks (GANs), have significantly improved the effectiveness of this technology, making it indispensable in both academic research and industrial applications. Here are a few of the use cases of ISR discussed in detail:
Before diving into this review on image super-resolution, we should have a basic understanding of the following concepts:
Low resolution images can be modeled from high resolution images using the below formula, where D is the degradation function, Iy is the high resolution image, Ix is the low resolution image, and 𝜎 is the noise.
Where:
The degradation parameters D and 𝜎 are unknown; only the high resolution image and the corresponding low resolution image are provided. The task of the neural network is to find the inverse function of degradation using just the HR and LR image data.
There are many methods used to solve this task. We will cover the following:
We’ll look at several example algorithms for each.
The methods under this bracket use traditional techniques–like bicubic interpolation and deep learning–to refine an upsampled image. The most popular method, SRCNN, was also the first to use deep learning and has achieved impressive results.
SRCNN is a simple CNN architecture consisting of three layers: one for patch extraction, non-linear mapping, and reconstruction. The patch extraction layer is used to extract dense patches from the input and represent them using convolutional filters. The non-linear mapping layer consists of 1×1 convolutional filters used to change the number of channels and add non-linearity. As you might have guessed, the final reconstruction layer reconstructs the high-resolution image.
The MSE loss function trains the network, and PSNR (discussed below in the Metrics section) evaluates the results. We will discuss both of these in more detail later.
Very Deep Super Resolution (VDSR) is an improvement on SRCNN, with the addition of the following features:
Since the feature extraction process in pre-upsampling SR occurs in the high-resolution space, the computational power required is also on the higher end. Post-upsampling SR tries to solve this by doing feature extraction in the lower resolution space, then doing upsampling only at the end, therefore significantly reducing computation. Also, instead of using simple bicubic interpolation for upsampling, a learned upsampling in deconvolution/sub-pixel convolution is used, thus making the network trainable end-to-end.
Let’s discuss a few popular techniques following this structure.
As can be seen in the above figure, the major changes between SRCNN and FSRCNN are:
FSRCNN ultimately achieves better results than SRCNN while also being faster.
ESPCN introduces the concept of sub-pixel convolution to replace the deconvolutional layer for upsampling. This solves two problems associated with it:
The figure below shows that sub-pixel convolution works by converting depth to space. Pixels from multiple channels in a low-resolution image are rearranged into a single high-resolution image channel. For example, an input image of size 5×5×4 can rearrange the pixels in the final four channels to a single channel, resulting in a 10×10 HR image.
Let’s now discuss a few more architectures which are based on the techniques from the figure below.
The EDSR architecture is based on the SRResNet architecture, consisting of multiple residual blocks. The residual block in EDSR is shown above. The major difference from SRResNet is that the Batch Normalization layers are removed. The author states that BN normalizes the input, thus limiting the range of the network; removal of BN results in an improvement in accuracy. The BN layers consume memory, and removing them leads to up to a 40% memory reduction, making the network training more efficient.
MDSR is an extension of EDSR, featuring multiple input and output modules that provide corresponding resolution outputs at 2x, 3x, and 4x. The pre-processing modules for scale-specific inputs start with two residual blocks that utilize 5×5 kernels. A larger kernel is employed in the pre-processing layers to maintain a shallow network while ensuring a high receptive field.
At the end of the scale-specific pre-processing modules are shared residual blocks, which serve as a common block for data across all resolutions. Following these shared, residual blocks are the scale-specific upsampling modules. Although the overall depth of MDSR is 5 times greater than that of single-scale EDSR, the number of parameters is only 2.5 times greater, rather than 5 times, due to the shared parameters. MDSR achieves results comparable to those of scale-specific EDSR, even with fewer parameters than the combined scale-specific EDSR models.
In the paper Fast, Accurate, and Lightweight Super-Resolution with Cascading Residual Network, the authors have proposed the following advancements on top of a traditional residual network:
The global connections in CARN are visualized above. The culmination of each cascading block with a 1×1 convolution receives inputs from all the previous cascading blocks and the initial input, thus resulting in an effective transfer of information.
Every residual block in a cascading block ends in a 1x1 convolution with connections from all previous residual blocks along with the main input, similar to how global cascading works.
The residual block in ResNet is replaced by a newly designed Residual-E block inspired by depthwise convolutions in MobileNet. Instead of depthwise convolutions, group convolutions are used, and the results show a decrease in 1.8-14x the number of computations used, depending on the group size.
To further reduce the number of parameters, a shared residual block (recursive block) is used, resulting in a reduced number of parameters by up to three times the original number. As seen in (d) above, a recursive shared block helps reduce the total number of parameters.
A multi-stage design is considered in a few architectures to improve their performance by dealing with feature extraction separately in the low-resolution and high-resolution space. The first stage predicts the coarse features, while the later stage improves on them. Let’s discuss an architecture involving one of these multi-stage networks.
The figure above illustrates the structure of the BTSRN, which is composed of two stages: a low-resolution (LR) stage and a high-resolution (HR) stage. The LR stage includes six residual blocks, while the HR stage consists of four blocks. Convolution operations in the HR stage require more computational power due to the larger input size. The number of blocks in both stages is strategically chosen to strike a balance between accuracy and performance.
The output of the LR stage is upsampled before being sent to the HR stage. This is done by adding the outputs of the Deconvolution layer and Nearest Neighbor upsampling.
The authors propose a novel residual block named PConv, as seen in (d) in the figure above. Based on the results, the proposed block achieves a good trade-off between accuracy and performance.
Similar to EDSR, Batch Normalization is avoided to prevent re-centering and re-scaling, which is found to be detrimental. This is because super-resolution is a regression task, and thus, target outputs are highly correlated with inputs’ first-order statistics.
Recursive networks employ shared network parameters in convolutional layers to reduce their memory footprint, as seen in CARN-M above. Let’s discuss a few more architectures involving recursive units.
Deep Recursive Convolutional Network (DRCN) involves applying the same convolution layer multiple times. As can be seen in the figure above, the convolutional layers in the residual block are shared.
The outputs from all the intermediate shared convolutional blocks and the input are sent to the reconstruction layer, which generates the high-resolution image using all of the inputs. Since multiple inputs are used to generate the output, this architecture can be considered an ensemble of networks.
Deep Recursive Residual Network (DRRN) is an improvement over DRCN by having residual blocks in the network over simple convolutional layers. The parameters in every residual block are shared with other residual blocks, as seen in the image above.
As the graph shows, DRRN outperforms SRCNN, ESPCN, VDSR, and DRCN while having a comparable number of parameters.
CNNs generally give outputs in a single shot, but getting a high-resolution image with a big scale factor (say 8x) is a tough task for a neural network. To solve this, some network architectures increase the resolution of images in steps. Now, let’s discuss a few networks that follow this style.
LAPSRN, or MS-LAPSRN, consists of a Laplacian pyramid structure that can upscale images to 2x, 4x, and 8x using a step-by-step approach.
As can be seen in the above figure, LAPSRN consists of multiple stages. The network consists of two branches: the Feature Extraction Branch and the Image Reconstruction Branch. Each iterative stage consists of a Feature Embedding Block and a feature Upsampling Block, as seen in the figure below. The input image is passed through a feature embedding layer to extract features in the low-resolution space, which is then upsampled using transpose convolution. The output learned is a residual image added to the interpolated input to get the high-resolution image. The output of the Feature Upsampling Block is also passed to the next stage, which is used for refining the high-resolution output of this stage and scaling it to the next level. Since lower-resolution outputs are used in refining further stages, shared learning helps the network perform better.
To reduce the network’s memory footprint, the parameters in Feature Embedding, Feature Upsampling, etc., are shared across the stages recursively.
Within the feature embedding block, an individual residual block consists of shared convolution parameters (shown in the figure above) to further reduce the number of parameters.
The authors argued that since each LR input can have multiple HR representations, an L2 loss function produces a smoothed output over all representations, making the images look sloppy. The Charbonnier loss function is used to deal with this, which can better handle outliers.
We’ve observed a clear trend: deeper networks tend to produce better results. However, training these deeper networks can be challenging due to issues with information flow. Residual networks help to mitigate this problem by incorporating shortcut connections. Additionally, multi-branch networks enhance information flow by utilizing multiple branches, allowing information to circulate through different pathways. This results in the integration of information from various receptive fields, leading to improved training outcomes. Let’s explore some networks that utilize this technique.
Like other super-resolution frameworks, the Cascaded Multi-Scale Cross-Network (CMSC) has a feature extraction layer, cascaded sub-nets, and a reconstruction layer–shown below.
The cascaded sub-network consists of two branches, as seen in (b). Each branch has different-sized filters, resulting in a different receptive field. Fusion of information from different receptive fields across the module results in better information flow. Multiple blocks of MSCs are stacked one after another to gradually decrease the difference between the output and HR image iteratively. The outputs from all the blocks are passed together to a reconstruction block to get the final HR output.
Information Distillation Network (IDN) is proposed to achieve fast and accurate results for super-resolution. Like other multi-branch networks, IDN utilizes the capability of multiple branches to improve the information flow in a deep network.
The IDN architecture consists of FBlock for feature extraction, multiple DBlocks, and RBlock for transposed convolution to achieve learned upscaling. The paper’s contribution is in the DBlock, which consists of two units: the Enhancement Unit and the Compression Unit.
The structure of the enhancement unit is illustrated in the figure above. The input is processed through three convolutional filters, each with a size of 3×3, and then divided into slices. One part of the slice is concatenated with the original input to create a shortcut connection to the final layer. The remaining slice is passed through another set of 3×3 convolutional filters. The final output is generated by summing both the inputs and the output from the final layer. This structure effectively captures both short-range and long-range information simultaneously.
The compression unit takes the output of the enhancement unit and passes it through a 1×1 convolutional filter to compress (or reduce) the number of channels.
The networks discussed so far give equal importance to all spatial locations and channels. However, selective attention to different regions in an image can yield much better results. We shall now discuss a few architectures that help achieve this.
SelNet introduces an innovative Selection Unit at the end of convolutional blocks to help determine which information to selectively pass on. A Selection Module consists of a ReLU activation function, followed by a 1×1 convolution and a sigmoid gating mechanism. The Selection Unit is created by multiplying the Selection Module with an identity connection.
A sub-pixel layer, similar to ESPCN, is placed towards the network’s end to achieve learned upscaling. The network learns a residual high-resolution image, which is then added to the interpolated input to produce the final high-resolution image.
Throughout this article, we have observed that deeper networks improve performance. To train deeper networks, Residual Channel Attention Networks (RCAN) suggest RIR modules with Channel attention.
Let’s discuss these more in detail.
The input in RCAN is passed through a single convolutional filter for feature extraction, which is then bypassed towards the final layer with a long skip connection. The long skip connection is added to carry the low-frequency signals from the LR image, while the main network (i.e., RIR) focuses on capturing the high-frequency information.
RIR consists of multiple RG blocks, each with a structure shown in the above figure. Each RG block has multiple RCAB modules and a skip connection, referred to as a short skip connection, to help transfer the low-frequency signal.
RCAB has a structure (as shown above) comprised of a GAP module to achieve channel attention, similar to the Squeeze and Excite blocks in SqueezeNet. The channel-wise attention is multiplied by the output from the sigmoid gating function of a convolutional block. This output is then added to the shortcut input connection to get the final output value of a RCAB block.
The networks discussed optimize the pixel difference between predicted and output HR images. Although this metric works fine, it is not ideal; humans don’t distinguish images by pixel difference but by perceptual quality. Generative models (or GANs) try to optimize the perceptual quality to produce images that are pleasant to the human eye. Finally, let’s take a look at a few GAN-related architectures.
SRGAN uses a GAN-based architecture to generate visually pleasing images. It uses the SRResnet network architecture as a backend and employs a multi-task loss to refine the results. The loss consists of three terms:
Although the results obtained had comparatively lower PSNR values, the model achieved more MOS, i.e., a better perceptual quality.
EnhanceNet uses a Fully Convolutional Network with residual learning, which employs an extra term in the loss function to capture finer texture information. In addition to the above described losses in SRGAN, a texture loss similar to the one in Style Transfer is employed to capture the finer texture information.
ESRGAN improves on top of SRGAN by adding a relativistic discriminator. The advantage is that the network is trained not only to tell which image is true or fake but also to make real images look less real compared to the generated images, thus helping to fool the discriminator. Batch normalization in SRGAN has also been removed, and Dense Blocks (inspired by DenseNet) have been used to improve information flow. These Dense Blocks are called Residual-in-Residual Dense Block or RRDB.
The following are some of the common datasets used to train super-resolution networks.
In this section, we shall discuss various loss functions which can be used to train the networks.
In this section we shall discuss the various metrics used to compare the performance of various models.
By this point you might be like:
Let’s code one of the popular architectures we have discussed so far, ESPCN.
As we know, ESPCN consists of convolutional layers for feature extraction followed by sub-pixel convolution for upsampling. We are using the TensorFlow depth_to_space
function to perform sub-pixel convolution.
We will be using the DIV2K dataset to train the model. We split the 2k resolution images into patches of 17×17 to provide model input for training. The authors convert RGB images to the YCrCb format and then upscale the Y channel input using ESPCN. The Cr and Cb channels are upscaled using bicubic interpolation, and all the upscaled channels are stitched together to get the final HR image. Thus, while training, we only need to provide the Y channel of the low-resolution data and the high-resolution images to the model.
In this article, we covered super-resolution, its applications, the taxonomy of super-resolution algorithms, and their advantages and limitations. Then, we looked at some of the publicly available datasets to get started, as well as the different kinds of loss functions and metrics that can be used. Finally, we went through the code for the ESPCN architecture.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!