Tutorial

YOLO-World: A Demo to Real-Time, Zero-Shot Object Detection

Updated on November 8, 2024

Technical Writer

YOLO-World:  A Demo to Real-Time, Zero-Shot Object Detection

Introduction

YOLO-World a SOTA model that joins the YOLO series. This new model can perform object detection on (allegedly) any object without the need to train the model. Now that’s something new and incredible!

In this article, let’s dig into YOLO-World, a groundbreaking zero-shot object detector that boasts a remarkable 20-fold speed enhancement compared to its predecessors. We’ll explore its architecture, dissect the primary factors contributing to its exceptional speed, and most importantly, we will go through the process of running the model to analyze both images and videos.

image

What’s new in YOLO-World

If we talk about traditional object detection models like Faster R-CNN, Single Shot Detectors (SSD), or YOLO for that matter, these models are confined to detecting objects within the predefined categories (such as the 80 categories in the COCO dataset). This is a facet of all supervised learning models. Recently, researchers have turned their attention towards developing open-vocabulary models. These models aim to address the need for detecting new objects without the necessity of creating new datasets, a process which is both time-consuming and costly.

The YOLO-World Model presents a cutting-edge, real-time method built upon Ultralytics YOLOv8, revolutionizing Open-Vocabulary Detection tasks. This advancement allows for the identification of various objects in images using descriptive texts. With reduced computational requirements yet maintaining top-tier performance, YOLO-World proves to be adaptable across a wide array of vision-based applications.

image

Speed-and-Accuracy Curve Comparison. Models evaluated on the LVIS minival and inference speeds were measured on one NVIDIA V100 w/o TensorRT. The size of the circle represents the model’s size. Source

Model Architecture

image

YOLO-World, unlike traditional YOLO detectors, integrates text input by employing a Text Encoder to encode text embeddings. Simultaneously, an Image Encoder processes the input image into multi-scale features. The RepVL-PAN model then merges image and text features at multiple levels. Finally, YOLO-World predicts bounding boxes and object embeddings to match the categories or nouns mentioned in the input text.

The architecture of the YOLO-World consists of a YOLO detector, a Text Encoder, and a Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN).

YOLO-World primarily builds upon YOLOv8, which includes a Darknet backbone serving as the image encoder, a path aggregation network (PAN) for generating multi-scale feature pyramids, and a head for both bounding box regression and object embeddings.

The detector captures multi-scale features from the input image, a text encoder to convert the text into embeddings, utilizes a network for multi-level fusion of image features and text embeddings, and incorporates a custom network for the same purpose.

Using a lighter and faster CNN network as its backbone is one of the reasons for YOLO-World’s speed. The second one is prompt-then-detect paradigm. Instead of encoding your prompt each time you run inference, YOLO-World uses Clip to convert the text into embeddings. Those embeddings are then cached and reused, bypassing the need for real-time text encoding.

YOLO-World achieves its speed through two main strategies. Firstly, it adopts a lighter and faster CNN network as its backbone. Secondly, it employs a “prompt-then-detect” paradigm. Unlike traditional methods that encode text prompts each time during inference, YOLO-World utilizes Clip to convert text into embeddings. These embeddings are cached and reused, eliminating the need for real-time text encoding, thereby enhancing speed and efficiency.

Code Demo

Let us start by checking the running GPU

!nvidia-smi

image

Now that we have the confirmed output that CUDA session has GPU support, it’s time to install the necessary libraries.

!pip install -U ultralytics

Once the requirement is satisfied, move to the next step of importing the libraries

import ultralytics
ultralytics.__version__
Output-

'8.1.28'
from ultralytics import YOLOWorld

# Initialize a YOLO-World model
model = YOLOWorld('yolov8s-world.pt')  # or select yolov8m/l-world.pt for different sizes

# Execute inference with the YOLOv8s-world model on the specified image
results = model.predict('dog.png',save=True)

# Show results
results[0].show()

image

Now, if we want the model to predict certain objects in the image without training we can do that by simply passing the argument on the model.set_classes() function.

Let us use the same image and try to predict the backpack, a truck and a car which is in the background.

# Define custom classes
model.set_classes(["backpack", "car","dog","person","truck"])

# Execute prediction for specified categories on an image
results = model.predict('/notebooks/data/dog.jpg', save=True)

# Show results
results[0].show()

image

Next, let us try to experiment with another image and here we will predict only a pair of shoes.

# Define custom classes
model.set_classes(["shoes"])
# Execute prediction for specified categories on an image
results = model.predict('/notebooks/data/Two-dogs-on-a-walk.jpg', save=True)

# Show results
results[0].show()

image

Object Detection using a video

Let us now try out the ‘yolov8s-world.pt’ model to detect objects in a video. We will execute the following code to carry out the object detection using a saved video.

!yolo detect predict model=yolov8s-world.pt source="/content/pexels-anthony-shkraba-8064146 (1440p).mp4"

This code block will generate a “runs” folder in your current directory. Within this folder, you’ll find the video saved in the “predict” subfolder, which itself resides within the “detect” folder.

Available Models

Please find the table below which has a details of the models which are available and the tasks they support.

image

Ultralytics offers a Python API and CLI commands designed for user-friendly development, simplifying the development process.

Detection Capabilitites with Defined Categories

We tried the model’s detection capabilities with our defined categories. Here are few images that we tried YOLOV8m. Please feel free to try other models from YOLO-World.

image

image

Conclusion

In this article we introduce YOLO-World, an advanced real-time detector aiming to enhance efficiency and open-vocabulary capability in practical settings. This approach is a novel addition to the traditional YOLO architectures to support open-vocabulary pre-training and detection, utilizing RepVL-PAN to integrate vision and language information effectively. Our experiments with different images demonstrates YOLO-World’s superior speed and performance, showcasing the benefits of vision-language pre-training on compact models. We envision YOLO-World as a new benchmark for real-world open-vocabulary detection tasks.

References

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the authors
Default avatar

Technical Writer

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Still looking for an answer?

Ask a questionSearch for more help

Was this helpful?
 
Leave a comment


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Limited Time: Introductory GPU Droplet pricing.

Get simple AI infrastructure starting at $2.99/GPU/hr on-demand. Try GPU Droplets now!

Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

Become a contributor for community

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

DigitalOcean Documentation

Full documentation for every DigitalOcean product.

Resources for startups and SMBs

The Wave has everything you need to know about building a business, from raising funding to marketing your product.

Get our newsletter

Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.

New accounts only. By submitting your email you agree to our Privacy Policy

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.