Implementing Baidu’s RT-DETR for Real-Time Object Detection

Updated on August 29, 2025

Object Detection

Computer Vision

AI/ML

By Shaoni Mukherjee

Technical Writer

Implementing Baidu’s RT-DETR for Real-Time Object Detection

Introduction

In this article, we introduce Real-Time Detection Transformer (RT-DETR), the first real-time end-to-end object detector addressing the issue of the high computational cost associated with DETRs. The recent research paper DETRs Beat YOLOs on Real-Time Object Detection, a Baidu Inc., successfully analyzed the negative impact of non-maximum suppression (NMS) on real-time detectors and proposed an efficient hybrid encoder for multi-scale feature processing. The IoU-aware query selection enhances performance. RT-DETR-L achieves 53.0% AP on COCO val2017 at 114 FPS, outperforming YOLO detectors. RT-DETR-X achieves 54.8% AP at 74 FPS, surpassing YOLO in both speed and accuracy. RT-DETR-R50 achieves 53.1% AP at 108 FPS, outperforming DINO-DeformableDETR-R50 by 2.2% AP in accuracy and 21 times in FPS.

Key Points

RT-DETR is Baidu’s optimized object detection model that combines the accuracy of transformers with real-time inference capabilities.
Unlike traditional DETR, RT-DETR introduces optimized attention mechanisms and hybrid encoder-decoder structures to reduce latency.
It achieves real-time performance without sacrificing detection accuracy, making it suitable for applications like autonomous driving, surveillance, and robotics.
RT-DETR supports end-to-end training with fewer handcrafted components, simplifying implementation.
Deploying RT-DETR with GPU acceleration enables scalable, production-ready real-time object detection pipelines.

Source

Object detection is a task of identifying or localizing certain objects in an image or video. Object detection models have various practical applications across different domains, such as:

Autonomous Vehicles: Object detection is crucial for enabling autonomous vehicles to identify and track pedestrians, vehicles, traffic signs, and other objects on the road.
Retail Analytics: In retail, object detection helps track and analyze customer behavior, monitor inventory levels, and reduce theft through the identification of suspicious activities.
Facial Recognition: Object detection is a fundamental component of facial recognition systems, used in applications such as access control, identity verification, and security.
Environmental Monitoring: Object detection models can be applied in environmental monitoring to track and analyze wildlife movements, monitor deforestation, or assess changes in ecosystems.
Gesture Recognition: Object detection is used to interpret and recognize human gestures, facilitating interaction with devices through gestures in applications like gaming or virtual reality.
Agriculture: Object detection models can assist in crop monitoring, pest detection, and yield estimation by identifying and analyzing objects such as plants, fruits, or pests in agricultural images.

However, these are just a few; there are many more use cases where object detection plays a crucial role.

Recently, transformer-based detectors have achieved remarkable performance by utilizing Vision Transformers (ViT) to process the multiscale features effectively by separating intra-scale interaction and cross-scale fusion. It is highly adaptable, allowing for the flexible adjustment of inference speed through various decoder layers without the need for retraining.

Prerequisites

Before implementing RT-DETR, ensure you have:

Basic knowledge of deep learning and object detection concepts.
Familiarity with PyTorch (or a similar deep learning framework).
A system with a GPU (CUDA-enabled preferred) for real-time training and inference.
Experience with handling image datasets (e.g., COCO, Pascal VOC).

What is new in RT-DETR

To enable real-time object detection, a streamlined hybrid encoder substitutes the original transformer encoder. This redesigned encoder efficiently manages the processing of multi-scale features by separating intra-scale interaction and cross-scale fusion, allowing for effective feature processing across different scales.

To further enhance the performance, IoU-aware query selection is introduced during the training phase that offers higher-quality initial object queries to the decoder through IoU constraints. Additionally, the proposed detector allows for the convenient adjustment of inference speed using different decoder layers, leveraging the DETR architecture’s decoder design. This feature streamlines the practical application of the real-time detector without requiring retraining. Hence becomes the new SOTA for real-time object detection. Real-time object detectors, which can be roughly classified into two categories: anchor-based and anchor-free.

Anchor-Based Object Detectors

In anchor-based detectors, predefined anchor boxes or regions of interest are used to predict the presence of objects and their bounding boxes.
These anchor boxes are generated at various scales and aspect ratios across the image.
The detector predicts two main components for each anchor box: class probabilities (is there an object or not) and bounding box offsets (adjustments to the anchor box to tightly fit the object).
Popular examples of anchor-based detectors include Faster R-CNN, R-FCN (Region-based Fully Convolutional Networks), and RetinaNet.

Anchor-Free Object Detectors

Anchor-free detectors do not rely on predefined anchor boxes. Instead, they directly predict bounding boxes and object presence without the need for anchor boxes.
These detectors often employ keypoint-based or center-ness prediction methods.
Keypoint-based methods identify key points (corners, center, etc.) and use them to estimate object bounding boxes.
Center-ness prediction focuses on determining the likelihood of a pixel being the center of an object, and bounding boxes are constructed based on these centers.
Popular examples of anchor-free detectors include CenterNet and FCOS (Fully Convolutional One-Stage).

End to end object detectors first proposed by Carion et al.an object detector based on Transformer, named DETR (DEtection TRansformer) has successfully attracted significant attention due to its distinctive features. DETR removes the need for hand-designed anchors and Non-Maximum Suppression (NMS) components found in traditional detection pipelines. Instead, it utilizes bipartite matching and directly predicts one-to-one object sets. This approach simplifies the detection pipeline, addressing the performance bottleneck associated with NMS. However, DETR faces challenges, including slow training convergence and difficulties in optimizing queries.

Use of NMS

Non-Maximum Suppression (NMS) is a widely used post-processing algorithm in object detection. It addresses overlapping prediction boxes by filtering out those with scores below a specified threshold and discarding lower-scored boxes when their Intersection over Union (IoU) exceeds a second threshold. NMS iteratively processes all boxes for each category, making its execution time dependent on the number of input prediction boxes and the two hyperparameters: score threshold and IoU threshold.

Model Architecture

The RT-DETR model comprises a backbone, a hybrid encoder, and a transformer decoder with auxiliary prediction heads. The architecture leverages features from the last three stages of the backbone {S3, S4, S5} as input to the encoder, which uses an intra-scale interaction and cross-scale fusion to transform multi-scale features into an image feature sequence. IoU-aware query selection is then applied to choose a fixed number of image features from the encoder output as initial queries for the decoder. The decoder, along with auxiliary prediction heads, iteratively refines these queries to generate object boxes and confidence scores.

Overview RT-DETR (Source)

A novel Efficient Hybrid Encoder is proposed for RT-DETR. This encoder consists of two modules, the Attention-based Intrascale Feature Interaction (AIFI) module and the CNN based Cross-scale Feature-fusion Module (CCFM). Further, to generate a scalable version of RT-DETR , the ResNet backbone was replaced with HGNetv2.

Dataset Used

The model was trained using the COCO train2017 and validated on the COCO val2017 dataset. Further ResNet and HGNetv2 series pretrained on ImageNet with SSLD from PaddleClas1 as the backbone were used in the model. For IoU-aware query selection, the top 300 encoder features are chosen to initialize the object queries of the decoder. The training strategy and hyperparameters of the decoder closely align with the DINO approach. The detectors were trained using the AdamW optimizer, and data augmentation was conducted with random {colour distort, expand, crop, flip, resize} operations.

Comparisons with other SOTA models

The RT-DETR, when compared to other real-time and end-to-end object detectors, successfully demonstrates superior performance. Specifically, RT-DETR-L achieves 53.0% Average Precision (AP) at 114 Frames Per Second (FPS), and RT-DETR-X achieves 54.8% AP at 74 FPS. These results outperform current state-of-the-art YOLO detectors in terms of both speed and accuracy. Additionally, RT-DETR-R50 achieves 53.1% AP at 108 FPS, and RT-DETR-R101 achieves 54.3% AP at 74 FPS, surpassing the state-of-the-art end-to-end detectors with the same backbone in both speed and accuracy. RT-DETR allows for flexible adjustment of inference speed by applying different decoder layers, all without requiring retraining. This feature enhances the practical applicability of the real-time detector.

Ultralytics RT-DETR Pre-trained Model

Ultralytics is committed to the development of top-notch artificial intelligence models globally. Their open-source projects on GitHub showcase state-of-the-art solutions across a diverse array of AI tasks, encompassing detection, segmentation, classification, tracking, and pose estimation.

Ultralytics Python API provides pre-trained RT-DETR models with different scales:

RT-DETR-L: 53.0% AP on COCO val2017, 114 FPS on T4 GPU
RT-DETR-X: 54.8% AP on COCO val2017, 74 FPS on T4 GPU

The example code snippet offers straightforward training and inference illustrations for RT-DETRR using the ultralytics pre-trained model. For comprehensive documentation on these modes and others, refer to the pages dedicated to Predict, Train, Val, and Export in the documentation.

!pip install ultralytics

from ultralytics import RTDETR

# Load a COCO-pretrained RT-DETR-l model
model = RTDETR('rtdetr-l.pt')

# Display model information (optional)
model.info()

# Train the model on the COCO8 example dataset for 100 epochs
results = model.train(data='coco8.yaml', epochs=100, imgsz=640)

# Run inference with the RT-DETR-l model on the 'bus.jpg' image
results = model('path/to/bus.jpg')

Let us check the inference on an image and video saved in the local folder!


results = model.predict('https://m.media-amazon.com/images/I/61fNoq7Y6+L._AC_UF894,1000_QL80_.jpg', show=True)


results = model.predict(source='input_video/input_video.mp4', show=True)

FAQs

Q1: What makes RT-DETR different from the original DETR model?
RT-DETR improves over DETR by introducing real-time optimizations, such as efficient attention mechanisms and architectural refinements, which significantly reduce inference latency while maintaining accuracy.

Q2: Can RT-DETR be used for real-world applications like autonomous vehicles?
Yes. RT-DETR is specifically designed for latency-sensitive tasks such as autonomous driving, video surveillance, and robotics, where real-time decision-making is crucial.

Q3: What hardware is recommended for running RT-DETR in production?
For real-time performance, GPU acceleration (e.g., NVIDIA A100 or H100) is recommended, although it can also run on mid-range GPUs for smaller-scale deployments.

Q4: Does RT-DETR require extensive hyperparameter tuning?
No. RT-DETR follows an end-to-end training paradigm, reducing the need for manual tuning. However, dataset-specific fine-tuning can further optimize results.

Q5: How does RT-DETR handle scaling for larger datasets?
RT-DETR supports parallelized training and inference on GPUs, making it scalable for large datasets. Its efficiency ensures consistent performance even with complex real-world data.

Conclusions

In this article, we discussed Baidu’s Real-Time Detection Transformer (RT-DETR). The model stands out for its advanced end-to-end object detection, delivering real-time performance without compromising accuracy. RT-DETR harnesses the capabilities of vision transformers to effectively handle multiscale features. The model’s key features include an Efficient Hybrid Encoder, IoU-aware Query Selection, and Adaptable Inference Speed. We used the pre-trained model from Ultralytics to demonstrate its performance on images and videos.

For teams looking to scale real-time object detection into production, platforms like DigitalOcean’s Gradient™ AI GPU Droplets provide the necessary infrastructure to run models such as RT-DETR efficiently. With on-demand GPUs, simple deployment workflows, and cost-effective scaling, DigitalOcean makes it easier to experiment, fine-tune, and operationalize cutting-edge computer vision models.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Shaoni Mukherjee

Author

Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Category:

Tags: