Object detection has become one of the most popular and practical uses of AI. A breakthrough came in 2015 with the release of YOLO (You Only Look Once) by Joseph Redmon and his team, which introduced real-time object detection in a single pass. Since then, the YOLO models have continued to improve and inspire further research in deep learning-based detection.
In this article, we’ll go back to the basics, look at what’s new with YOLOv8 from Ultralytics—and show you how to fine-tune a custom YOLOv8 model using Roboflow and DigitalOcean GPU Droplets with the updated Ultralytics API. By the end, you’ll be able to train YOLOv8 on your own labeled image dataset in no time.
(Source)
To start, let’s discuss the basics of how YOLO works. Here is a short quote breaking down the sum of the model’s functionality from the original YOLO paper:
“A single convolutional network is used to predict multiple bounding boxes along with their class probabilities. Unlike traditional object detection methods, YOLO is trained on entire images and directly optimizes detection performance. This unified approach offers significant advantages in both speed and accuracy.”
As stated above, the model is capable of predicting the location and identifying the subject of multiple entities in an image, provided it has been trained to recognize these features before. It does this in a single stage by separating the image into N grids, each of size s*s. These regions are simultaneously parsed to detect and localize any objects contained within. The model then predicts bounding box coordinates, B, in each grid with a label and prediction score for the object contained within.
Combining these capabilities, YOLO emerges as a powerful technology that can perform object classification, object detection, and image segmentation. Since the core architecture of YOLO remains consistent across versions, this holds true for YOLOv8 as well. For a more detailed explanation of how YOLO functions, you can refer to our previous articles covering YOLOv7 and the original YOLO research paper.
YOLOv8 introduces several significant improvements over its predecessors, making an advancement to the previous version of YOLO. Developed by Ultralytics, YOLOv8 is built with a redesigned architecture that offers better accuracy and speed across various computer vision tasks, including object detection, instance segmentation, pose estimation, and image classification. It features a modular and scalable design, improved training workflows, and support for dynamic input shapes. YOLOv8 also integrates native support for export to popular deployment formats such as ONNX, TensorRT, and CoreML, enabling seamless deployment across diverse platforms. With its focus on ease of use, performance optimization, and compatibility with modern ML pipelines, YOLOv8 sets a new standard for real-time vision models.
Credit to the creator: RangeKing
According to the official release, YOLOv8 features a new backbone network, an anchor-free detection head, and a loss function. Github user RangeKing has shared this outline of the YOLOv8 model infrastructure, showing the updated model backbone and head structures. According to a comparison of this diagram with a comparable examination of YOLOv5, RangeKing identified the following changes in their post:
The C2f
module, credit to RoboFlow (Source)
C3
module with the C2f
module. In C2f
, all the outputs from the Bottleneck
(the two 3x3 convs
with residual connections) are concatenated, but in C3
only the output of the last Bottleneck
was used. (Source)The first Conv of each version. Credit to RangeKing
6x6 Conv
with a 3x3 Conv
block in the Backbone
Conv
s (No.10 and No.14 in the YOLOv5 config)Comparison of the two model backbones. Credit to RangeKing
1x1 Conv
with a 3x3 Conv
in the Bottleneck
.objectness
branchIn addition to the old methodology of cloning the Github repo and setting up the environment manually, users can now access YOLOv8 for training and inference using the new Ultralytics API. Check out the Training your model section below for details on setting up the API.
YOLOv8 now features the anchor free bounding boxes. In the previous iterations of YOLO, users were required to manually identify these anchor boxes to facilitate the object detection process. These predefined bounding boxes of predetermined size and height capture the scale and aspect ratio of specific object classes in the data set. Calculating the offset from these boundaries to the predicted object helps the model better identify the location of the object.
With YOLOv8, these anchor boxes are automatically predicted at the center of an object.
At each epoch during training, YOLOv8 sees a slightly different version of the images it has been provided. These changes are called augmentations. One of these, Mosaic augmentation, is the process of combining four images, forcing the model to learn the identities of the objects in new locations, partially blocking each other through occlusion, with greater variation on the surrounding pixels. It has been shown that using this throughout the entire training regime can be detrimental to prediction accuracy, so YOLOv8 can stop this process during the final epochs of training. This allows for the optimal training pattern to be run without extending to the entire run.
The main reason we are all here is the big boost to performance accuracy and efficiency during both inference and training. The authors at Ultralytics have provided us with some useful sample data that we can use to compare the new release with other versions of YOLO. We can see from the plot above that YOLOv8 outperforms YOLOv7, YOLOv6-2.0, and YOLOv5-7.0 in terms of mean Average Precision, size, and latency during training.
Model | size (pixels) | mAPval 50-95 | Speed CPU ONNX (ms) | Speed A100 TensorRT (ms) | params (M) | FLOPs (B) |
---|---|---|---|---|---|---|
YOLOv8n | 640 | 37.3 | 80.4 | 0.99 | 3.2 | 8.7 |
YOLOv8s | 640 | 44.9 | 128.4 | 1.20 | 11.2 | 28.6 |
YOLOv8m | 640 | 50.2 | 234.7 | 1.83 | 25.9 | 78.9 |
YOLOv8l | 640 | 52.9 | 375.2 | 2.39 | 43.7 | 165.2 |
YOLOv8x | 640 | 53.9 | 479.1 | 3.53 | 68.2 | 257.8 |
In their respective Github pages, we can find the statistical comparison tables for the different-sized YOLOv8 models. As we can see from the table above, the mAP increases as the size of the parameters, speed, and FLOPs increase. The largest YOLOv5 model, YOLOv5x, achieved a maximum mAP value of 50.7. The 2.2 unit increase in mAP represents a significant improvement in capabilities. This is observed across all model sizes, with the newer YOLOv8 models consistently outperforming YOLOv5, as shown by the data below.
Model | size (pixels) | mAPval 50-95 | mAPval 50 | Speed CPU b1 (ms) | Speed V100 b1 (ms) | Speed V100 b32 (ms) | params (M) | FLOPs @640 (B) |
---|---|---|---|---|---|---|---|---|
YOLOv5n | 640 | 28.0 | 45.7 | 45 | 6.3 | 0.6 | 1.9 | 4.5 |
YOLOv5s | 640 | 37.4 | 56.8 | 98 | 6.4 | 0.9 | 7.2 | 16.5 |
YOLOv5m | 640 | 45.4 | 64.1 | 224 | 8.2 | 1.7 | 21.2 | 49.0 |
YOLOv5l | 640 | 49.0 | 67.3 | 430 | 10.1 | 2.7 | 46.5 | 109.1 |
YOLOv5x | 640 | 50.7 | 68.9 | 766 | 12.1 | 4.8 | 86.7 | 205.7 |
Overall, we can see that YOLOv8 represents a significant step up from YOLOv5 and other competing frameworks.
The process for fine-tuning a YOLOv8 model can be broken down into three steps: creating and labeling the dataset, training the model, and deploying it. In this tutorial, we will cover the first two steps in detail and show how to use our new model on any incoming video file or stream.
In order to follow along, we need to use a GPU powered machine. We recommend accessing one on the cloud, like DigitalOcean’s GPU Droplets. Once the GPU machine is accessible, we are going to operate under the assumption that the user is working in a Jupyter Notebook. This makes it far easier to execute the code sequentially in this tutorial format.
To follow this demo, clone the following repo using the code snippet below and launch the Jupyter environment.
git clone https://github.com/gradient-ai/YOLOv8-Ballhandler
cd YOLOv8-Ballhandler
jupyter lab
We are going to be recreating the experiment we used for YOLOv7 to compare the two models, so we will be using the Basketball dataset on Roboflow. Since we are using a previously made dataset, we just need to pull the data in for now. Below is the command used to pull the data into a Notebook environment. Use this same process for your own labeled dataset, but replace the workspace and project values with your own to access your dataset in the same manner.
Be sure to change the API key to your own if you want to use the script below to follow the demo in the Notebook.
With the new Python API, we can use the ultralytics
library to facilitate all of the work within a Jupyter Notebook environment. We will build our YOLOv8n
model from scratch using the provided config and weights. We will then fine-tune it using the dataset we just loaded into the environment using the model.train()
method.
We can set our new model to evaluate the validation set using the model.val()
method. This will output a nice table showing how our model performed in the output window. Seeing as we only trained here for ten epochs, this relatively low mAP 50-95 is to be expected.
From there, it’s simple to submit any photo. It will output the predicted values for the bounding boxes, overlay those boxes to the image, and upload to the ‘runs/detect/predict’ folder.
We are left with the predictions for the bounding boxes and their labels, printed like this:
These are then applied to the image, like the example below:
As we can see, our lightly trained model shows that it can recognize the players on the court from the players and spectators on the side of the court, with one exception in the corner. More training is almost definitely required, but it’s easy to see that the model very quickly gained an understanding of the task.
If we are satisfied with our model training, we can then export the model in the desired format. In this case, we will export an ONNX version.
In this tutorial, we explored the key updates introduced in Ultralytics’ YOLOv8 model. We examined how its architecture has evolved from YOLOv5 and tested its easy-to-use Python API with our Ballhandler dataset. The results showed that YOLOv8 greatly simplifies the process of fine-tuning object detection models. We also demonstrated its effectiveness in real-world tasks, such as identifying which player holds the ball in an NBA game using a single in-game photo. To run these experiments smoothly, we recommend using DigitalOcean GPU Droplets, which offer the computing power needed to train and deploy YOLOv8 models efficiently.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!