YOLO-World: A Demo to Real-Time, Zero-Shot Object Detection

Updated on November 8, 2024

Technical Writer

YOLO-World: A Demo to Real-Time, Zero-Shot Object Detection

Introduction

YOLO-World a SOTA model that joins the YOLO series. This new model can perform object detection on (allegedly) any object without the need to train the model. Now that’s something new and incredible!

In this article, let’s dig into YOLO-World, a groundbreaking zero-shot object detector that boasts a remarkable 20-fold speed enhancement compared to its predecessors. We’ll explore its architecture, dissect the primary factors contributing to its exceptional speed, and most importantly, we will go through the process of running the model to analyze both images and videos.

What’s new in YOLO-World

If we talk about traditional object detection models like Faster R-CNN, Single Shot Detectors (SSD), or YOLO for that matter, these models are confined to detecting objects within the predefined categories (such as the 80 categories in the COCO dataset). This is a facet of all supervised learning models. Recently, researchers have turned their attention towards developing open-vocabulary models. These models aim to address the need for detecting new objects without the necessity of creating new datasets, a process which is both time-consuming and costly.

The YOLO-World Model presents a cutting-edge, real-time method built upon Ultralytics YOLOv8, revolutionizing Open-Vocabulary Detection tasks. This advancement allows for the identification of various objects in images using descriptive texts. With reduced computational requirements yet maintaining top-tier performance, YOLO-World proves to be adaptable across a wide array of vision-based applications.

Speed-and-Accuracy Curve Comparison. Models evaluated on the LVIS minival and inference speeds were measured on one NVIDIA V100 w/o TensorRT. The size of the circle represents the model’s size. Source

Model Architecture

YOLO-World, unlike traditional YOLO detectors, integrates text input by employing a Text Encoder to encode text embeddings. Simultaneously, an Image Encoder processes the input image into multi-scale features. The RepVL-PAN model then merges image and text features at multiple levels. Finally, YOLO-World predicts bounding boxes and object embeddings to match the categories or nouns mentioned in the input text.

The architecture of the YOLO-World consists of a YOLO detector, a Text Encoder, and a Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN).

YOLO-World primarily builds upon YOLOv8, which includes a Darknet backbone serving as the image encoder, a path aggregation network (PAN) for generating multi-scale feature pyramids, and a head for both bounding box regression and object embeddings.

The detector captures multi-scale features from the input image, a text encoder to convert the text into embeddings, utilizes a network for multi-level fusion of image features and text embeddings, and incorporates a custom network for the same purpose.

Using a lighter and faster CNN network as its backbone is one of the reasons for YOLO-World’s speed. The second one is prompt-then-detect paradigm. Instead of encoding your prompt each time you run inference, YOLO-World uses Clip to convert the text into embeddings. Those embeddings are then cached and reused, bypassing the need for real-time text encoding.

YOLO-World achieves its speed through two main strategies. Firstly, it adopts a lighter and faster CNN network as its backbone. Secondly, it employs a “prompt-then-detect” paradigm. Unlike traditional methods that encode text prompts each time during inference, YOLO-World utilizes Clip to convert text into embeddings. These embeddings are cached and reused, eliminating the need for real-time text encoding, thereby enhancing speed and efficiency.

Code Demo

Let us start by checking the running GPU

!nvidia-smi

Now that we have the confirmed output that CUDA session has GPU support, it’s time to install the necessary libraries.

!pip install -U ultralytics

Once the requirement is satisfied, move to the next step of importing the libraries

import ultralytics
ultralytics.__version__

Output-

'8.1.28'

from ultralytics import YOLOWorld

# Initialize a YOLO-World model
model = YOLOWorld('yolov8s-world.pt')  # or select yolov8m/l-world.pt for different sizes

# Execute inference with the YOLOv8s-world model on the specified image
results = model.predict('dog.png',save=True)

# Show results
results[0].show()

Now, if we want the model to predict certain objects in the image without training we can do that by simply passing the argument on the model.set_classes() function.

Let us use the same image and try to predict the backpack, a truck and a car which is in the background.

# Define custom classes
model.set_classes(["backpack", "car","dog","person","truck"])

# Execute prediction for specified categories on an image
results = model.predict('/notebooks/data/dog.jpg', save=True)

# Show results
results[0].show()

Next, let us try to experiment with another image and here we will predict only a pair of shoes.

# Define custom classes
model.set_classes(["shoes"])
# Execute prediction for specified categories on an image
results = model.predict('/notebooks/data/Two-dogs-on-a-walk.jpg', save=True)

# Show results
results[0].show()

Object Detection using a video

Let us now try out the ‘yolov8s-world.pt’ model to detect objects in a video. We will execute the following code to carry out the object detection using a saved video.

!yolo detect predict model=yolov8s-world.pt source="/content/pexels-anthony-shkraba-8064146 (1440p).mp4"

This code block will generate a “runs” folder in your current directory. Within this folder, you’ll find the video saved in the “predict” subfolder, which itself resides within the “detect” folder.

Available Models

Please find the table below which has a details of the models which are available and the tasks they support.

Ultralytics offers a Python API and CLI commands designed for user-friendly development, simplifying the development process.

Detection Capabilitites with Defined Categories

We tried the model’s detection capabilities with our defined categories. Here are few images that we tried YOLOV8m. Please feel free to try other models from YOLO-World.

Conclusion

In this article we introduce YOLO-World, an advanced real-time detector aiming to enhance efficiency and open-vocabulary capability in practical settings. This approach is a novel addition to the traditional YOLO architectures to support open-vocabulary pre-training and detection, utilizing RepVL-PAN to integrate vision and language information effectively. Our experiments with different images demonstrates YOLO-World’s superior speed and performance, showcasing the benefits of vision-language pre-training on compact models. We envision YOLO-World as a new benchmark for real-world open-vocabulary detection tasks.

References

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Shaoni Mukherjee

Author

Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Category:

Tags: