YOLO continues to evolve in the world of computer vision. Real-time object detection plays a major role in powering various applications across industries. Quite a number of start-ups are coming forward with the vision of using algorithms like YOLO to power their applications—from autonomous vehicles to surveillance systems to robotics to smart retail and smart sunglasses. At the heart of these advancements, the YOLO series continues to play a major role in revolutionizing object detection.
YOLO (You Only Look Once) is a single-shot object detection model that processes an entire image in one pass, making it extremely fast and efficient. Unlike traditional object detection models that first propose regions and then classify them (like Faster R-CNN), YOLO directly predicts objects and their locations in a single neural network run. This modernized approach allows YOLO models to detect objects with speed, accuracy, and efficiency.
YOLOv12 introduces novel advancements that make it faster, more accurate, and more efficient than ever before. By using attention centric YOLO framework, optimized feature aggregation, and redefined architecture, YOLOV12 surpasses previous YOLO models and also outperforms end-to-end detectors like RT-DETR.
In this article, we will understand how YOLOv12 takes things to the next level.
With innovations like the Area Attention (A²) module, Residual Efficient Layer Aggregation Networks (R-ELAN), and FlashAttention, YOLOv12 outperforms its predecessors while maintaining low latency. Notably, YOLOv12-N achieves 40.6% mAP with just 1.64 ms latency on a T4 GPU, surpassing YOLOv10-N and YOLOv11-N with a comparable speed. It also beats end-to-end real-time detectors like RT-DETR and RT-DETRv2, running 42% faster while using fewer parameters and computations.
Let’s understand YOLOv12 in detail and learn how to use It with DigitalOcean’s GPU Droplet powered by H100.
where AP (Average Precision) is computed for each class.
A higher F1-score indicates better performance.
A higher IoU means better localization.
Frames Per Second (FPS): Measures the inference speed (how fast the model processes images).
Higher FPS = Faster model.
Lower FLOPs = Faster model, but might reduce accuracy.
Lower latency = Faster inference.
YOLOv12 introduces three major advancements to increase speed, accuracy, and efficiency while keeping computational costs low. These improvements focus on better attention mechanisms, optimized feature aggregation, and architectural refinements.
Introduces a Block-Level Residual Design
Redesigned Feature Aggregation
To further optimize YOLOv12’s speed and efficiency, the architecture has been refined in several key ways:
Based on the designs above, YOLOv12 comes with five models again optimized for modern GPUs: YOLOv12-N, S, M, L, and X.
The YOLO series introduces new advancements and innovations with each YOLO version. The early versions (YOLO 1-3) laid the framework and architectural foundations. In contrast, the later version (YOLO 7 to YOLO11) shifted towards better gradient flow using ELAN along with various techniques to improve the model’s efficiency.
YOLO Version | Key Innovations | Improvements |
---|---|---|
YOLO (1-3) | Established the YOLO framework | Introduced real-time object detection with a single-stage pipeline |
YOLOv4 | CSPNet, data augmentation, multiple feature scales | Improved model efficiency and accuracy |
YOLOv5 | CSPNet enhancements, streamlined architecture | Faster inference, better deployment adaptability |
YOLOv6 | BiC, SimCSPSPPF, anchor-aided training | Optimized backbone and neck for improved performance |
YOLOv7 | EELAN (Efficient Layer Aggregation Networks), bag-of-freebies | Enhanced gradient flow and overall efficiency |
YOLOv8 | C2f block for feature extraction | Improved accuracy and computational efficiency |
YOLOv9 | GELAN for architecture optimization, PGI for better training | Reduced training overhead and model refinement |
YOLOv10 | NMS-free training with dual assignments | Increased efficiency in object detection |
YOLOv11 | C3K2 module, lightweight depthwise separable convolution | Lower latency and improved accuracy |
RT-DETR | Efficient encoder, uncertainty-minimal query selection | Real-time end-to-end object detection |
RT-DETRv2 | Additional bag-of-freebies | Further optimization of end-to-end detection models |
YOLOv12 | Attention-centered architecture | Utilizes attention mechanisms for improved detection |
This table highlights how each YOLO iteration introduced advancements in model architecture, efficiency, and accuracy.
As depicted in the image, this progression from CSPNet → ELAN → C3K2 → R-ELAN represents increasing architectural complexity and is thus aimed to improve the gradient flow, feature reuse, and computational efficiency with each iteration.
With the increasing demand for high-performance object detection models, deploying YOLOv12 efficiently requires powerful hardware capable of handling real-time inference. DigitalOcean’s GPU Droplets can be a great solution for running YOLOv12 inference to deliver speed and optimal accuracy using high-performance NVIDIA GPUs.
To run YOLOv12, create a GPU Droplet with the following specifications:
Once the droplet is created, install the necessary libraries:
Install PyTorch and Ultralytics YOLO, which supports YOLOv12 models.
Use the following command to download a pre-trained YOLOv12 model:
To perform object detection on images or videos using DigitalOcean’s GPU, run the code provided below:
YOLOv12 has been validated on the MSCOCO 2017 dataset, which includes five models: YOLOv12-N, S, M, L, and X. All models were trained for 600 epochs using the SGD optimizer with a 0.01 learning rate, similar to YOLOv11. Latencies are tested on a T4 GPU with TensorRT FP16. YOLOv11 is the baseline, maintaining its scaling strategy and C3K2 blocks without additional modifications.
Here’s a breakdown of how each version performs:
YOLOv12-N (smallest version) is more accurate than YOLOv6, YOLOv8, YOLOv10, and YOLOv11 by up to 3.6% in mean Average Precision (mAP). Despite this, it remains efficient, processing each image in just 1.64 milliseconds while using the same or fewer resources.
YOLOv12-S (small version) has 21.4G FLOPs (a measure of computing power) and 9.3 million parameters. It achieves 48.0 mAP while taking 2.61 milliseconds per image, making it faster and more efficient than YOLOv8-S, YOLOv9-S, YOLOv10-S, and YOLOv11-S. It also performs better than RT-DETR models, which are end-to-end detectors while using less computing power.
YOLOv12-M (medium version), with 67.5G FLOPs and 20.2 million parameters, reaches 52.5 mAP and processes each image in 4.86 milliseconds. It outperforms GoldYOLO-M, YOLOv8-M, YOLOv9-M, YOLOv10, YOLOv11, and RT-DETR models, making it a strong choice for medium-sized models.
YOLOv12-L (large version) is more efficient than YOLOv10-L, using 31.4G fewer FLOPs while achieving higher accuracy. It also outperforms YOLOv11 by 0.4% mAP while maintaining similar efficiency. Compared to RT-DETR models, it is 34.6% more efficient in computations and uses 37.1% fewer parameters, making it faster and lighter.
YOLOv12-X (largest version) achieves even better results, improving accuracy over YOLOv10-X and YOLOv11-X while maintaining similar speed and efficiency. It is also significantly faster and more efficient than RT-DETR models, using 23.4% less computing power and 22.2% fewer parameters.
This table compares the performance of YOLOv12 with various models in the YOLO series (from YOLOv9 to YOLOv12). The table shows the performance evaluation on different GPUs across various model scales, from Tiny/Nano to Extra Large. The comparison is based on FLOPs (computational complexity) and inference speed, measured in frames per second (FPS) on three NVIDIA GPUs (RTX 3080, A5000, A6000).
Smaller models (Tiny, Nano, Small) tend to be faster but less accurate, while larger models (Large, Extra Large) have higher FLOPs and slower speeds. Inference speed is presented with two values, which likely correspond to different batch sizes. Overall, the performance across the different GPUs is quite similar, though the A6000 and A5000 GPUs exhibit slightly higher efficiency in some cases.
YOLOv12 is the latest iteration of the YOLO object detection model. It introduces attention-based mechanisms to improve detection accuracy while maintaining real-time performance. Key innovations include Area Attention, Residual Efficient Layer Aggregation Networks (R-ELAN), and optimized training strategies. These advancements make YOLOv12 one of the most efficient and accurate object detection models to date.
YOLOv12 improves upon YOLOv11 in several ways:
Overall, YOLOv12 provides a better latency-accuracy trade-off than YOLOv11.
YOLOv12’s ability to process images and videos in real-time makes it ideal for various applications, including:
To train YOLOv12 on a custom dataset:
Prepare Your Data: Organize images and annotations in YOLO format. Install Dependencies
Evaluate Performance: Use model.val()
to check mAP scores.
YOLOv12 requires GPUs that support FlashAttention, including:
For optimal performance, H100 on platforms like DigitalOcean GPU Droplets is recommended.
YOLOv12 successfully brings attention-based mechanisms into the YOLO framework while maintaining real-time performance. Traditionally, attention-based models have been considered inefficient for fast inference, but YOLOv12 successfully optimizes them using area attention and residual efficient layer aggregation networks (R-ELAN).
These enhancements improve feature extraction, making object detection more accurate while maintaining high-speed performance. By refining attention mechanisms to align with YOLO’s real-time constraints, YOLOv12 achieves state-of-the-art accuracy and efficiency. This advancement challenges the dominance of purely CNN-based YOLO models and paves the way for smarter, more efficient object detection systems.
Despite its improvements, YOLOv12 has a few limitations:
Despite its limitations, YOLOv12 marks a significant advancement in the field of object detection. It demonstrates that attention-based architectures can improve real-time detection without compromising speed. This model sets a new standard for balancing accuracy, efficiency, and scalability in this area.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!