Tutorial

Creating a Personal Assistant with LLaMA 3.2

Published on October 21, 2024

Technical Evangelist // AI Arcanist

Creating a Personal Assistant with LLaMA 3.2

One of the most immediately usable applications of AI has to be the chatbot. Though they have been around in some capacity for 30 years, the last two have seen the true rise of these personal assistants chat interfaces. They have percolated out from high technical circles into all walks of life, social situations, and business use cases.

While they are available and easy to use, many of the best chatbots are locked behind walled gardens of closed source technology. These environments come with their own advantages, like internet search connectivity, custom profiles to interact with the agents, and access to some of the best Large Language Models (LLMs) available. However, the open-source community has always been in step with these developments, and open-source tools come with their own advantages, namely customization, fine-tuning, and complex integrations with other technologies.

In this article, we are going to take a look at one of the latest and greatest open-source model suites: LLaMA 3.2. These 11b parameter, vision enabled models boast comparable performance to GPT-4o mini and Claude 3 - Haiku across a wide variety of vision-instruction tuned benchmarks at significantly lower cost. Additionally, their lightweight 2b parameter models offer comparable performance to SOTA edge models like PHI mini it and Gemma 2b. We will begin this tutorial with a breakdown of what makes these models special. Afterwards, we will show how to take this powerful open-source model, and create a personal assistant chatbot. This personal assistant can help with anything from writing to coding to translation, and is fully enabled to work with image data as easily as text.

Follow along with us to get a deeper dive into the advantages of using these incredible models on DigitalOcean GPU Droplets with our guide to running an powerful assistant application, adapted from HuggingFace Projects staffer merve’s original project.

Prerequisites

In order to follow along with this article, you will need experience with Python code, and a beginners understanding of Deep Learning. We will operate under the assumption that all readers have access to sufficiently powerful machines, so they can run the code provided.

If you do not have access to a GPU, we suggest accessing it through the cloud. There are many cloud providers that offer GPUs. DigitalOcean GPU Droplets are currently in Early Availability, learn more and sign up for interest in GPU Droplets here.

For instructions on getting started with Python code, we recommend trying this beginners guide to set up your system and preparing to run beginner tutorials.

The LLaMa 3.2 Model Suite

The LLaMA 3.2 Model suite is comprised of four main models: a small 11B vision-language model, a larger 90B vision-language model, and two lightweight text-only models (1B and 3B). The former models are capable of viewing and understanding both text and image data, while the latter stick to text. Each of these offer SOTA or near SOTA performance for their classes.

Source: Meta Labs Source

The creators posit that their models offer comparable performance to noted foundation models GPT-4o Mini and Claude 3 - Haiku. We can see from the graphic above, that these statements are backed up by benchmarking across a wide variety of Vision-Language tasks, including math and problem solving tasks. This is particularly interesting where the 11B model seems comparatively powerful, as it shows the potential to run these peak performant LLMs on weaker GPUs or even near edge hardware. This means that we can expect competitive performance from these models when used side by side with top tier closed-source models.

Source: Meta Labs Source

In the next graphic, we can see the reported comparison of the lightweight, text-only models with top competitors PHI Mini IT and Gemma 2b. Across all benchmarks, the lightweight models performed better or nearly the same as these noted foundation models. This shows that LLaMA 3.2 models can perform at the highest level for running on weaker processors, with potential for even handheld devices generating text.

As we can see, both categories of the model suite are highly competitive with SOTA foundation models of the same caliber, and can even outperform even closed source alternatives.

Strengths and Weaknesses of LLaMA 3.2

When looking at strengths and weaknesses, the first critical thing to understand about LLaMA 3.2 is that it does not come with the same features attached as many closed-source tools, but they are able to be customized in ways that closed-source models cannot. Things like user profiles with saved information and custom settings, internet connectivity or search, and low code add ons for RAG or other features are not going to be integrated with LLaMA 3.2, unless we do those ourselves.

With that said, let’s take a look at each of the main strengths of LLaMA 3.2 itself in comparison to other extant models:

Strengths

  • Fine-tuning & LoRAs: the ability to customize the existing model will always be the greatest strength of closed source models. Thanks to these techniques, it’s simple to take the incredible work already done on these LLMs, and specialize them for our task. Whether this is for making a customer chatbot or a math machine, the possibilities here are fairly endless.
  • SOTA performance: comparable performance to state of the art models of every size, including powerful closed-source models
  • Custom Integrations: models like LLaMA 3.2 are easy to integrate into a wide variety of existing applications without the need to construct or use a potentially expensive or difficult to use API client.
  • Visual language analysis: where LLaMA has really impressed us is with understanding image data. It is capable of anything from OCR to interpretation to object detection, though obviously these are limited to text outputs.
  • Summarization and rewriting: as shown in their release blog, the models are highly capable at a wide variety of text tasks, but are highlighted especially with regards to these two tasks.
  • Cost: LLaMA 3.2 is both free to use (non-commercially) and download, and one of the most efficient models in its category. The lightweight models can run on extremely weak processors like those found in handheld devices.

Weaknesses

  • Achieving peak performance: despite the major strides forward, LLaMA 3.2 is still not quite at the level of the greatest GPT models available, like Claude Sonnet or GPT 4o.
  • Internet accessibility: without significant development on the part of the user, the models are incapable of accessing the internet or learning new information without something like RAG in place to foster it. This is a key advantage to many of the current top models like those from Anthropic.
  • License: unlike some completely open source models, it’s worth noting again that LLaMA’s model allows for non-commercial use only.

Now that we have examined where the model shines, let’s show how to run the demo. This will let us examine the model intimately as we create our own personal assistant tool.

Launching the LLaMA 3.2 demo on a DigitalOcean GPU Droplet

For this demo, we are going to be borrowing from the work of the HuggingFace Projects team. They have released a truly perfect Gradio demo on their website that uses the LLaMA 3.2 11B Vision Models. Before continuing, please be sure to go see their original work here.

To adapt the application, we are going to use the newly released DigitalOcean GPU Droplets. In order to follow along, we recommend using this guide to create a new GPU Droplet with a Jupyter Notebook. We are going to run the code shown in this section within a Jupyter Notebook running on our GPU Droplet, accessed in our local browser through SSH hosting with Visual Studio Code.

Once your machine is running, you have spun up the Notebook, and have accessed it in your local browser, continue with the next section.

The LLaMA 3.2 Personal Assistant Application Breakdown

The LLaMA 3.2 application from HuggingFace, on it’s own is excellent. It let’s you chat with a powerful chatbot, capable of interacting with any image we upload. This alone allows us to do complicated actions like Object Content Recognition or Object Detection or even reading. Additionally, thanks to writing it with Gradio, it is also completely FastAPI enabled, and the application can be interacted with directly using the API endpoint it creates.

Take a look at the code block below, which contains the entire application, with some small adjustments for space saving measures.

from transformers import MllamaForConditionalGeneration, AutoProcessor, TextIteratorStreamer
from PIL import Image
import requests
import torch
from threading import Thread
import gradio as gr
from gradio import FileData
import time
import spaces
ckpt = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(ckpt,
    torch_dtype=torch.bfloat16).to("cuda")
processor = AutoProcessor.from_pretrained(ckpt)


@spaces.GPU
def bot_streaming(message, history, max_new_tokens=250):
    
    txt = message["text"]
    ext_buffer = f"{txt}"
    
    messages= [] 
    images = []
    

    for i, msg in enumerate(history): 
        if isinstance(msg[0], tuple):
            messages.append({"role": "user", "content": [{"type": "text", "text": history[i+1][0]}, {"type": "image"}]})
            messages.append({"role": "assistant", "content": [{"type": "text", "text": history[i+1][1]}]})
            images.append(Image.open(msg[0][0]).convert("RGB"))
        elif isinstance(history[i-1], tuple) and isinstance(msg[0], str):
            # messages are already handled
            pass
        elif isinstance(history[i-1][0], str) and isinstance(msg[0], str): # text only turn
            messages.append({"role": "user", "content": [{"type": "text", "text": msg[0]}]})
            messages.append({"role": "assistant", "content": [{"type": "text", "text": msg[1]}]})

    # add current message
    if len(message["files"]) == 1:
        
        if isinstance(message["files"][0], str): # examples
            image = Image.open(message["files"][0]).convert("RGB")
        else: # regular input
            image = Image.open(message["files"][0]["path"]).convert("RGB")
        images.append(image)
        messages.append({"role": "user", "content": [{"type": "text", "text": txt}, {"type": "image"}]})
    else:
        messages.append({"role": "user", "content": [{"type": "text", "text": txt}]})


    texts = processor.apply_chat_template(messages, add_generation_prompt=True)

    if images == []:
        inputs = processor(text=texts, return_tensors="pt").to("cuda")
    else:
        inputs = processor(text=texts, images=images, return_tensors="pt").to("cuda")
    streamer = TextIteratorStreamer(processor, skip_special_tokens=True, skip_prompt=True)

    generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=max_new_tokens)
    generated_text = ""
    
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    buffer = ""
    
    for new_text in streamer:
        buffer += new_text
        generated_text_without_prompt = buffer
        time.sleep(0.01)
        yield buffer


demo = gr.ChatInterface(fn=bot_streaming, title="Multimodal Llama", 
      textbox=gr.MultimodalTextbox(), 
      additional_inputs = [gr.Slider(
              minimum=10,
              maximum=1024,
              value=250,
              step=10,
              label="Maximum number of new tokens to generate",
              
          )
        ],
      cache_examples=False,
      description="Try Multimodal Llama by Meta with transformers in this demo. Upload an image, and start chatting about it, or simply try one of the examples below. To learn more about Llama Vision, visit [our blog post](https://huggingface.co/blog/llama32). ",
      stop_btn="Stop Generation", 
      fill_height=True,
    multimodal=True)
    
demo.launch(debug=True)

Effectively, what is shown above is an incredibly concise pipeline. First, the script loads in the model files onto the GPU. Next, it launches the Gradio application which let’s us interact with the model via our web browser. If we follow the instructions from the previously mentioned article, we can take the URL that outputs from running the cell, paste into the simple browser URL bar in VS Code, and open the new Gradio window in our browser. Alternatively, you can set the demo.launch() parameter public to True in order to open a publicly accessible link.

All put together, when run we get something like this:

image

This interactive application allows us to interact with the model using either text or image data inputs. In our experiments, we found this process to be incredibly robust with regards to image understanding, OCR, and object recognition. We tried a variety of different image types, and received extremely robust results showing a deep intermarriage of knowledge between the textual and visual realms for the model.

Furthermore, the application on its own works quite well for normal LLM tasks. We were able to use it to generate working Python code, tell a story outline, and even assist with writing a short portion of this blog post! All in all, this application is incredibly versatile, and we will continue to use it for our LLM tasks going forward.

Closing Thoughts

LLaMA 3.2 is incredible. In terms of open source releases, it represents a tangible step towards the quality of popular closed source models that have dominated the market for the past year. It is capable of anything they are at nearly the same capability, including everything from code generation to visual understanding to long storywriting. We encourage all our readers to try the HuggingFace projects demo on a DigitalOcean GPU Droplet.

The possibilities for these models are really endless. From dynamic security cameras to comprehensive supply management to better chatbots, LLaMA is already improving all sorts of technologies around us. One thing we are interested in seeing in the near future is whether LLaMA 3.2 could be adapted as a suitable replacement for T5 models in popular text-to-image modeling, a technique being leveraged by teams like PixArt in their latest model to achieve SOTA quality.

Be sure to watch out for LLaMA 4 in the very near future, if our expectations are correct, too!

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the authors
Default avatar

Technical Evangelist // AI Arcanist

Still looking for an answer?

Ask a questionSearch for more help

Was this helpful?
 
Leave a comment


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!

Sign up

Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

Featured on Community

Get our biweekly newsletter

Sign up for Infrastructure as a Newsletter.

Hollie's Hub for Good

Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.

Become a contributor

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

Welcome to the developer cloud

DigitalOcean makes it simple to launch in the cloud and scale up as you grow — whether you're running one virtual machine or ten thousand.

Learn more