One of the most immediately usable applications of AI has to be the chatbot. Though they have been around in some capacity for 30 years, the last two have seen the true rise of these personal assistants chat interfaces. They have percolated out from high technical circles into all walks of life, social situations, and business use cases.
While they are available and easy to use, many of the best chatbots are locked behind walled gardens of closed source technology. These environments come with their own advantages, like internet search connectivity, custom profiles to interact with the agents, and access to some of the best Large Language Models (LLMs) available. However, the open-source community has always been in step with these developments, and open-source tools come with their own advantages, namely customization, fine-tuning, and complex integrations with other technologies.
In this article, we are going to take a look at one of the latest and greatest open-source model suites: LLaMA 3.2. These 11b parameter, vision enabled models boast comparable performance to GPT-4o mini and Claude 3 - Haiku across a wide variety of vision-instruction tuned benchmarks at significantly lower cost. Additionally, their lightweight 2b parameter models offer comparable performance to SOTA edge models like PHI mini it and Gemma 2b. We will begin this tutorial with a breakdown of what makes these models special. Afterwards, we will show how to take this powerful open-source model, and create a personal assistant chatbot. This personal assistant can help with anything from writing to coding to translation, and is fully enabled to work with image data as easily as text.
Follow along with us to get a deeper dive into the advantages of using these incredible models on DigitalOcean GPU Droplets with our guide to running an powerful assistant application, adapted from HuggingFace Projects staffer merve’s original project.
In order to follow along with this article, you will need experience with Python code, and a beginners understanding of Deep Learning. We will operate under the assumption that all readers have access to sufficiently powerful machines, so they can run the code provided.
If you do not have access to a GPU, we suggest accessing it through the cloud. There are many cloud providers that offer GPUs. DigitalOcean GPU Droplets are currently in Early Availability, learn more and sign up for interest in GPU Droplets here.
For instructions on getting started with Python code, we recommend trying this beginners guide to set up your system and preparing to run beginner tutorials.
The LLaMA 3.2 Model suite is comprised of four main models: a small 11B vision-language model, a larger 90B vision-language model, and two lightweight text-only models (1B and 3B). The former models are capable of viewing and understanding both text and image data, while the latter stick to text. Each of these offer SOTA or near SOTA performance for their classes.
The creators posit that their models offer comparable performance to noted foundation models GPT-4o Mini and Claude 3 - Haiku. We can see from the graphic above, that these statements are backed up by benchmarking across a wide variety of Vision-Language tasks, including math and problem solving tasks. This is particularly interesting where the 11B model seems comparatively powerful, as it shows the potential to run these peak performant LLMs on weaker GPUs or even near edge hardware. This means that we can expect competitive performance from these models when used side by side with top tier closed-source models.
In the next graphic, we can see the reported comparison of the lightweight, text-only models with top competitors PHI Mini IT and Gemma 2b. Across all benchmarks, the lightweight models performed better or nearly the same as these noted foundation models. This shows that LLaMA 3.2 models can perform at the highest level for running on weaker processors, with potential for even handheld devices generating text.
As we can see, both categories of the model suite are highly competitive with SOTA foundation models of the same caliber, and can even outperform even closed source alternatives.
When looking at strengths and weaknesses, the first critical thing to understand about LLaMA 3.2 is that it does not come with the same features attached as many closed-source tools, but they are able to be customized in ways that closed-source models cannot. Things like user profiles with saved information and custom settings, internet connectivity or search, and low code add ons for RAG or other features are not going to be integrated with LLaMA 3.2, unless we do those ourselves.
With that said, let’s take a look at each of the main strengths of LLaMA 3.2 itself in comparison to other extant models:
Now that we have examined where the model shines, let’s show how to run the demo. This will let us examine the model intimately as we create our own personal assistant tool.
For this demo, we are going to be borrowing from the work of the HuggingFace Projects team. They have released a truly perfect Gradio demo on their website that uses the LLaMA 3.2 11B Vision Models. Before continuing, please be sure to go see their original work here.
To adapt the application, we are going to use the newly released DigitalOcean GPU Droplets. In order to follow along, we recommend using this guide to create a new GPU Droplet with a Jupyter Notebook. We are going to run the code shown in this section within a Jupyter Notebook running on our GPU Droplet, accessed in our local browser through SSH hosting with Visual Studio Code.
Once your machine is running, you have spun up the Notebook, and have accessed it in your local browser, continue with the next section.
The LLaMA 3.2 application from HuggingFace, on it’s own is excellent. It let’s you chat with a powerful chatbot, capable of interacting with any image we upload. This alone allows us to do complicated actions like Object Content Recognition or Object Detection or even reading. Additionally, thanks to writing it with Gradio, it is also completely FastAPI enabled, and the application can be interacted with directly using the API endpoint it creates.
Take a look at the code block below, which contains the entire application, with some small adjustments for space saving measures.
from transformers import MllamaForConditionalGeneration, AutoProcessor, TextIteratorStreamer
from PIL import Image
import requests
import torch
from threading import Thread
import gradio as gr
from gradio import FileData
import time
import spaces
ckpt = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(ckpt,
torch_dtype=torch.bfloat16).to("cuda")
processor = AutoProcessor.from_pretrained(ckpt)
@spaces.GPU
def bot_streaming(message, history, max_new_tokens=250):
txt = message["text"]
ext_buffer = f"{txt}"
messages= []
images = []
for i, msg in enumerate(history):
if isinstance(msg[0], tuple):
messages.append({"role": "user", "content": [{"type": "text", "text": history[i+1][0]}, {"type": "image"}]})
messages.append({"role": "assistant", "content": [{"type": "text", "text": history[i+1][1]}]})
images.append(Image.open(msg[0][0]).convert("RGB"))
elif isinstance(history[i-1], tuple) and isinstance(msg[0], str):
# messages are already handled
pass
elif isinstance(history[i-1][0], str) and isinstance(msg[0], str): # text only turn
messages.append({"role": "user", "content": [{"type": "text", "text": msg[0]}]})
messages.append({"role": "assistant", "content": [{"type": "text", "text": msg[1]}]})
# add current message
if len(message["files"]) == 1:
if isinstance(message["files"][0], str): # examples
image = Image.open(message["files"][0]).convert("RGB")
else: # regular input
image = Image.open(message["files"][0]["path"]).convert("RGB")
images.append(image)
messages.append({"role": "user", "content": [{"type": "text", "text": txt}, {"type": "image"}]})
else:
messages.append({"role": "user", "content": [{"type": "text", "text": txt}]})
texts = processor.apply_chat_template(messages, add_generation_prompt=True)
if images == []:
inputs = processor(text=texts, return_tensors="pt").to("cuda")
else:
inputs = processor(text=texts, images=images, return_tensors="pt").to("cuda")
streamer = TextIteratorStreamer(processor, skip_special_tokens=True, skip_prompt=True)
generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=max_new_tokens)
generated_text = ""
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
buffer = ""
for new_text in streamer:
buffer += new_text
generated_text_without_prompt = buffer
time.sleep(0.01)
yield buffer
demo = gr.ChatInterface(fn=bot_streaming, title="Multimodal Llama",
textbox=gr.MultimodalTextbox(),
additional_inputs = [gr.Slider(
minimum=10,
maximum=1024,
value=250,
step=10,
label="Maximum number of new tokens to generate",
)
],
cache_examples=False,
description="Try Multimodal Llama by Meta with transformers in this demo. Upload an image, and start chatting about it, or simply try one of the examples below. To learn more about Llama Vision, visit [our blog post](https://huggingface.co/blog/llama32). ",
stop_btn="Stop Generation",
fill_height=True,
multimodal=True)
demo.launch(debug=True)
Effectively, what is shown above is an incredibly concise pipeline. First, the script loads in the model files onto the GPU. Next, it launches the Gradio application which let’s us interact with the model via our web browser. If we follow the instructions from the previously mentioned article, we can take the URL that outputs from running the cell, paste into the simple browser URL bar in VS Code, and open the new Gradio window in our browser. Alternatively, you can set the demo.launch()
parameter public
to True
in order to open a publicly accessible link.
All put together, when run we get something like this:
This interactive application allows us to interact with the model using either text or image data inputs. In our experiments, we found this process to be incredibly robust with regards to image understanding, OCR, and object recognition. We tried a variety of different image types, and received extremely robust results showing a deep intermarriage of knowledge between the textual and visual realms for the model.
Furthermore, the application on its own works quite well for normal LLM tasks. We were able to use it to generate working Python code, tell a story outline, and even assist with writing a short portion of this blog post! All in all, this application is incredibly versatile, and we will continue to use it for our LLM tasks going forward.
LLaMA 3.2 is incredible. In terms of open source releases, it represents a tangible step towards the quality of popular closed source models that have dominated the market for the past year. It is capable of anything they are at nearly the same capability, including everything from code generation to visual understanding to long storywriting. We encourage all our readers to try the HuggingFace projects demo on a DigitalOcean GPU Droplet.
The possibilities for these models are really endless. From dynamic security cameras to comprehensive supply management to better chatbots, LLaMA is already improving all sorts of technologies around us. One thing we are interested in seeing in the near future is whether LLaMA 3.2 could be adapted as a suitable replacement for T5 models in popular text-to-image modeling, a technique being leveraged by teams like PixArt in their latest model to achieve SOTA quality.
Be sure to watch out for LLaMA 4 in the very near future, if our expectations are correct, too!
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
The images in this article are not showing up