Tutorial

Visual Question Answering with Llama 3.2 HuggingFace

Published on March 25, 2025

Visual Question Answering with Llama 3.2 HuggingFace

Introduction

We recently evaluated LLaMA 3.2 11B with Vision on DigitalOcean’s H100 GPU (1 GPU, 80GB VRAM, 20 vCPUs, 240GB RAM) and found it highly effective for Visual Question Answering (VQA) tasks.

In this tutorial you will learn how to implement a scalable, cost-efficient, and streamlined approach for implementing AI-driven image processing. By using GPU Droplets for compute power and Digitalocean Spaces for storage, deploying and managing AI applications becomes seamless—offering high performance and reliability without the complexity of traditional on-premise setups.

What is Visual Question Answering (VQA)?

Visual Question Answering (VQA) is a subfield of artificial intelligence that focuses on training models to answer questions about images. It combines computer vision and natural language processing to enable machines to understand and interpret visual data, generating human-like responses to questions about the content of images.

Benefits of Visual Question Answering

The benefits of VQA are numerous, including:

Enhanced image understanding: VQA models can analyze images and provide insights that would be difficult or impossible for humans to extract manually.
Improved accessibility: VQA can be used to assist visually impaired individuals by providing audio descriptions of images.
Automation of tasks: VQA can automate tasks such as image classification, object detection, and image captioning, freeing up human resources for more complex tasks.

Who is Visual Question Answering For?

VQA is beneficial for a wide range of industries and applications, including:

Healthcare: VQA can be used to analyze medical images, such as X-rays and MRIs, to assist in diagnosis and treatment.
Retail: VQA can be used in e-commerce to automatically generate product descriptions and improve customer experience.
Education: VQA can be used to create interactive learning tools that provide students with a more engaging and immersive learning experience.

LLaMA 3.2 11B Vision Specifications

Feature	Description
Architecture	Natively multimodal (trained on text-image pairs) adapter that combines pre-trained 3.2 vision model with pre-trained Llama 3.1 language model
Model Variants	Instruction tuned: For visual recognition, image reasoning, and assistant-like chat with images, Pre-trained models: Adapted for a variety of image reasoning tasks.
Sequence Length	128k tokens
Licensing	Llama 3.2 Community: Commercial and research

Prerequisites

Before proceeding, ensure you have:

A DigitalOcean GPU Droplet deployed with Python 3.10+ installed.
A DigitalOcean Spaces account with an access key and secret key.
A DigitalOcean Managed MySQL database deployed.
A Hugging Face Token.

Step 1 - Set Up the Environment

Once you have your GPU Droplet deployed, please follow the below steps and SSH into the GPU Droplet.

SSH into Your GPU Droplet

ssh root@your-server-ip

Install Python & Create a Virtual Environment

apt install python3.10-venv -y  
python3.10 -m venv llama-env

Activate the Virtual Environment

source llama-env/bin/activate

Step 2 - Install Required Dependencies

Let’s install and configure the necessary packages and dependencies on the GPU Droplet.

Install PyTorch & Hugging Face CLI

pip install torch torchvision torchaudio  
pip install -U huggingface_hub[cli]  
huggingface-cli login

Install the Transformers Library

pip install --upgrade transformers

Install Flask & AWS SDK (Boto3)

Boto3 is required to interact with DigitalOcean Spaces, which is S3-compatible.

pip install flask boto3

Step 3 - Install & Configure Nginx

Install Nginx to serve your Flask application on the GPU Droplet.

sudo apt install nginx -y

Step 4 - Set Up the Flask Web Application

Application Folder Structure

llama-webapp/  
├── app.py                # Main Flask app file  
├── static/  
│   └── styles.css        # Optional: CSS file for styling  
└── templates/  
    └── index.html        # HTML template for the web page

Python Code for the Application

Please create a file called app.py inside the directory llama-webapp on the GPU Droplet and copy-paste the code below:

Note: Please keep your DigitalOcean Spaces name, region, access key and the secret key handy as you will need to add them in this step.

llama-webapp/app.py

import os
import requests
from PIL import Image
from flask import Flask, request, render_template, session
from transformers import MllamaForConditionalGeneration, AutoProcessor
import boto3
import torch
import re

app = Flask(__name__)
app.secret_key = "your-secure-random-key"

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

SPACE_NAME = "gpupro"
SPACE_REGION = "nyc3"
ACCESS_KEY = "your-access-key"
SECRET_KEY = "your-secret-key"

s3 = boto3.client(
    "s3",
    region_name=SPACE_REGION,
    endpoint_url=f"https://{SPACE_REGION}.digitaloceanspaces.com",
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY
)

def clean_text(text):
    text = re.sub(r"<\[^\>]+\>", "", text)
    text = re.sub(r"(?i)^(user|assistant):", "", text)
    text = re.sub(r"^\\\*+|\n\\\*+|\n-+|^-+", " ", text, flags=re.MULTILINE)
    text = re.sub(r"\s{2,}", " ", text).strip()
    return text

@app.route("/", methods=["GET", "POST"])
def index():
    result = None
    image_url = session.get("image_url")
    if request.method == "POST":
        prompt = request.form["prompt"]
        image_file = request.files.get("image")
        if image_file:
            filename = image_file.filename
            image_path = os.path.join("/tmp", filename)
            image_file.save(image_path)
            s3.upload_file(
                image_path,
                SPACE_NAME,
                filename,
                ExtraArgs={'ACL': 'public-read'}
            )
            image_url = f"https://{SPACE_NAME}.{SPACE_REGION}.digitaloceanspaces.com/{filename}"
            session["image_url"] = image_url
        if not image_url:
            result = "Please upload an image to generate a description."
        else:
            image = Image.open(requests.get(image_url, stream=True).raw)
            messages = [{"role": "user", "content": [
                {"type": "image"},
                {"type": "text", "text": prompt}
            ]}]
            input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
            inputs = processor(image, input_text, return_tensors="pt").to(model.device)
            output = model.generate(**inputs, max_new_tokens=28000)
            raw_result = processor.decode(output[0])
            result = clean_text(raw_result)
            if not result:
                result = "No description was generated. Please try again."
    return render_template("index.html", result=result, image_url=image_url)

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

Step 5 - Run & Access the Application

Start the Flask application:

python app.py

Open your browser and visit:

http://your_server_ip:5000

Upload an image and verify data storage in the database.

How Good is LLaMA 3.2 11B Vision Instruct?

We ran a series of visual language prompts to test LLaMA 3.2. Here are some results:

Example image uploaded for VQA

Prompt 1: What is the price of the service?
Response: The price of the service is 14.00 euros.
Prompt 2: Who is selling the product or service? Response: The product or service is being sold by the personenschiffahrt, as indicated by the text on the ticket.
Prompt 3 Based on the information in this image, how much do 10 tickets cost?

Response:
- Cost of one ticket: 14.00 euros
- Cost of 10 tickets: 14.00 euros × 10 = 140 euros

LLaMA 3.2 excelled at recognizing text fields in the image and making logical connections. It also provided a step-by-step breakdown of the price calculation for 10 tickets.

FAQs

1. What is the primary use case for LLaMA 3.2 Vision?

LLaMA 3.2 Vision is specifically designed for Visual Question Answering (VQA) tasks, which involve processing and analyzing images to answer questions about their content. This technology enables AI-driven image processing and analysis, making it an ideal solution for applications that require image understanding and interpretation.

2. What are the system requirements for running LLaMA 3.2 Vision?

To ensure optimal performance, a DigitalOcean GPU Droplet with at least 1 GPU, 80GB VRAM, 20 vCPUs, and 240GB RAM is recommended. This configuration provides the necessary computing power and memory to handle the complex image processing tasks that LLaMA 3.2 Vision is designed for. Additionally, a high-performance storage solution like DigitalOcean Spaces can be used to store and manage large datasets of images.

3. Can I use LLaMA 3.2 Vision for other tasks beyond VQA?

Yes, LLaMA 3.2 Vision is a versatile model that can be adapted for various image reasoning tasks beyond VQA. Its capabilities extend to image captioning, object detection, and image generation, making it a valuable tool for a wide range of applications that involve image analysis and processing. For example, it can be used in image search engines, autonomous vehicles, or medical imaging analysis.

4. How do I integrate LLaMA 3.2 Vision with my existing infrastructure?

Integrating LLaMA 3.2 Vision with your existing infrastructure is a straightforward process. You can leverage DigitalOcean’s GPU-optimized Droplets for compute power and Spaces for storage, ensuring a seamless deployment and management process. This allows you to scale your application as needed, without worrying about the underlying infrastructure. Additionally, DigitalOcean’s cloud-based infrastructure provides a flexible and cost-effective solution for deploying and managing AI applications.

5. What is the cost of using LLaMA 3.2 Vision on DigitalOcean?

The cost of using LLaMA 3.2 Vision on DigitalOcean depends on several factors, including the size and type of GPU Droplet you choose, as well as the amount of storage you require. You can estimate costs using DigitalOcean’s pricing calculator, which provides a transparent and predictable pricing model. This allows you to plan and budget your resources effectively, ensuring that you can deploy and manage your AI applications in a cost-effective manner.

Note: You can also sign up now and get free $200 credit to try our products over 60 days!

6. Is LLaMA 3.2 Vision suitable for real-time applications?

Yes, LLaMA 3.2 Vision is well-suited for real-time applications that require rapid image analysis and processing. Its ability to process images quickly and accurately makes it an ideal solution for applications such as live image analysis, chatbots, or autonomous systems that require real-time decision-making based on visual data. Additionally, its integration with DigitalOcean’s cloud infrastructure ensures that it can scale to meet the demands of real-time applications, providing a reliable and efficient solution for processing large volumes of image data.

Conclusion

With LLaMA 3.2 11B Vision, you’ve got a powerful tool for Visual Question Answering (VQA) that excels at reading text in images and explaining its thought process. By combining it with DigitalOcean’s high-performance infrastructure and Hugging Face’s Transformers library, you’ve created a solution that’s both efficient and easy to use. This technology has the potential to revolutionize various industries, from document processing to customer support and beyond. As AI continues to evolve, integrating models like LLaMA 3.2 will unlock new opportunities for AI-driven image analysis.