DETR (Detection Transformer) is a deep learning architecture first proposed as a new approach to object detection. It’s the first object detection framework to successfully integrate transformers as a central building block in the detection pipeline.
DETR completely changes the architecture compared with previous object detection systems. In this article, we delve into the concept of Detection Transformer (DETR), a groundbreaking approach to object detection.
According to Wikipedia, object detection is a computer technology related to computer vision and image processing that detects instances of semantic objects of a particular class (such as humans, buildings, or cars) in digital images and videos.
It’s used in self-driving cars to help the car detect lanes, other vehicles, and people walking. Object detection also helps with video surveillance and with image search. The object detection algorithms use machine learning and deep learning to detect the objects. Those are advanced ways for computers to learn independently based on looking at many sample images and videos.
Object detection works by identifying and locating objects within an image or video. The process involves the following steps:
DETR (Detection Transformer) is a deep learning architecture proposed as a new approach to object detection and panoptic segmentation. DETR is a groundbreaking approach to object detection that has several unique features.
DETR is an end-to-end trainable deep learning architecture for object detection that utilizes a transformer block. The model inputs an image and outputs a set of bounding boxes and class labels for each object query. It replaces the messy pipeline of hand-designed pieces with a single end-to-end neural network. This makes the whole process more straightforward and easier to understand.
DETR (Detection Transformer) is special primarily because it thoroughly relies on transformers without using some standard components in traditional detectors, such as anchor boxes and Non-Maximum Suppression (NMS).
In traditional object detection models like YOLO and Faster R-CNN, anchor boxes play a pivotal role. These models need to predefine a set of anchor boxes, which represent a variety of shapes and scales that an object may have in the image. The model then learns to adjust these anchors to match the actual object bounding boxes.
The utilization of these anchor boxes significantly improves the models’ accuracy, especially in detecting small-scale objects. However, the important caveat here is that the size and scale of these boxes must be fine-tuned manually, making it a somewhat heuristic process that could be better.
Similarly, NMS is another hand-engineered component used in YOLO and Faster R-CNN. It’s a post-processing step to ensure that each object gets detected only once by eliminating weaker overlapping detections. While it’s necessary for these models due to the practice of predicting multiple bounding boxes around a single object, it could also cause some issues. Selecting thresholds for NMS is not straightforward and could influence the final detection performance. The traditional object detection process can be visualized in the image below:
On the other hand, DETR eliminates the need for anchor boxes, managing to detect objects directly with a set-based global loss. All objects are detected in parallel, simplifying the learning and inference process. This approach reduces the need for task-specific engineering, thereby reducing the detection pipeline’s complexity.
Instead of relying on NMS to prune multiple detections, it uses a transformer to predict a fixed number of detections in parallel. It applies a set prediction loss to ensure each object gets detected only once. This approach effectively suppresses the need for NMS. We can visualize the process in the image below:
The lack of anchor boxes simplifies the model but could also reduce its ability to detect small objects because it cannot focus on specific scales or ratios. Nevertheless, removing NMS prevents the potential mishaps that could occur through improper thresholding. It also makes DETR more easily end-to-end trainable, thus enhancing its efficiency.
One thing about DETR is that its structure with attention mechanisms makes the models more understandable. We can easily see what parts of an image focus on, when it makes a prediction. It not only enhances accuracy but also aids in understanding the underlying mechanisms of these computer vision models.
This understanding is crucial for improving the models and identifying potential biases. DETR broke new ground in taking transformers from NLP into the vision world, and its interpretable predictions are a nice bonus from the attention approach. The unique structure of DETR has several real-world applications where it has proved to be beneficial:
DETR utilizes a set-based overall loss function that compels unique predictions through bipartite matching, a distinctive aspect of DETR. This unique feature of DETR helps ensure that the model produces accurate and reliable predictions. The set-based total loss matches the predicted bounding boxes with the ground truth boxes. This loss function ensures that each predicted bounding box is matched with only one ground truth bounding box and vice versa.
The diagram represents the process of computing the set-based loss.
Embarking through the diagram above, we first stumble upon a fascinating input stage where predicted and ground truth objects are fed into the system. As we progress deeper into its mechanics, our attention is drawn towards a computational process that entails computing a cost matrix.
The Hungarian algorithm comes forth in time to orchestrate optimal matching between predicted and ground-truth objects—the algorithm factors in classification and bounding box losses for each match paired.
Predictions that fail to find a counterpart are handed off the “no object” label with their respective classification loss evaluated. All these losses are aggregated to compute the total set-based loss, which is then outputted, marking the end of the process.
This unique matching forces the model to make distinct predictions for each object. The global nature of evaluating the complete set of forecasts together compared to the ground truths drives the network to make coherent detections across the entire image. So, the special pairing loss provides supervision at the level of the whole prediction set, ensuring robust and consistent object localization.
We can look at the diagram of the DETR architecture below. We encode the image on one side and then pass it to the Transformer decoder on the other side. No crazy feature engineering or anything manual anymore. It’s all learned automatically from data by the neural network.
As shown in the image, DETR’s architecture consists of the following components:
The Transformers architecture adopted by DETR is shown in the picture below:
DETR brings some new concepts to the table for object detection. It uses object queries, keys, and values as part of the Transformer’s self-attention mechanism.
Usually, the number of object queries is set beforehand and doesn’t change based on how many objects are actually in the image. The keys and values come from encoding the image with a CNN. The keys show where different spots are in the image, while the values hold information about features. These keys and values are used for self-attention so the model can determine which parts of the image are most important.
The true innovation in DETR lies in its use of multi-head self-attention. This lets DETR understand complex relationships and connections between different objects in the image. Each attention head can focus on various pieces of the image simultaneously.
The facebook/detr-resnet-50 model is an implementation of the DETR model. At its core, it’s powered by a transformer architecture.
Specifically, this model uses an encoder-decoder transformer and a backbone ResNet-50 convolutional neural network. This means it can analyze an image, detect various objects within it, and identify what those objects are.
The researchers trained this model on a vast dataset called COCO that has tons of labeled everyday images with people, animals, and cars. This way, the model learned to detect everyday real-world objects like a pro. The provided code demonstrates the usage of the DETR model for object detection.
from transformers import DetrImageProcessor, DetrForObjectDetection
import torch
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# you can specify the revision tag if you don't want the timm dependency
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50", revision="no_timm")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50", revision="no_timm")
inputs = processor(images=image, return_tensors="pt")
outputs = model(__inputs)
# convert outputs (bounding boxes and class logits) to COCO API
# let's only keep detections with score > 0.9
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
box = [round(i, 2) for i in box.tolist()]
print(
f"Detected {model.config.id2label[label.item()]} with confidence "
f"{round(score.item(), 3)} at location {box}"
)
Output:
DetrImageProcessor
to prepare it for the model.DetrForObjectDetection
model from the “facebook/detr-resnet-50” using the from_pretrained
method. The revision="no_timm"
parameter specifies the revision tag if the time dependency is not desired.processor
prepares the image for input, and the model
performs the object detection task.processor.post_process_object_detection
method to obtain the final detection results.DETR is a deep learning model for object detection that leverages the Transformer architecture. It was initially designed for natural language processing (NLP) tasks as its main component to address the object detection problem uniquely and highly effectively.
DETR treats the object detection problem differently from traditional object detection systems like Faster R-CNN or YOLO. It simplifies the detection pipeline by dropping multiple hand-designed components that encode prior knowledge, like spatial anchors or non-maximal suppression.
It uses a set global loss function that compels the model to generate unique predictions for each object by matching them in pairs. This trick helps DETR make good predictions that we can trust.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!