Turning Your 1-Click Model GPU Droplets Into A Personal Assistant

Published on November 15, 2024

Technical Evangelist // AI Arcanist

Turning Your 1-Click Model GPU Droplets Into A Personal Assistant

1-Click Models are the new collaborative project from DigitalOcean and Hugging Face to bring you an easy method to interface with some of the best open-source Large Language Models (LLMs) on the most powerful GPUs available on the cloud. Together, users can optimize their usage of the best open-source models with no hassle or coding to setup.

In this tutorial, we are going to show and walkthrough the development of a voice-enabled personal assistant tool designed to run on any 1-Click Model enabled GPU Droplet. This application uses Gradio, and is fully API enabled with FastAPI. Follow along to learn more about the advantages of using 1-Click Models, learn the basics of querying a deployed 1-Click Model GPU Droplet, and see how to use the personal assistant on your own machines!

1-Click Hugging Face Models with DigitalOcean GPU Droplets

The new 1-Click models come with a wide variety of LLM options, all with different use cases. These are namely:

meta-llama/Meta-Llama-3.1-8B-Instruct
meta-llama/Meta-Llama-3.1-70B-Instruct
meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
Qwen/Qwen2.5-7B-Instruct
google/gemma-2-9b-it
google/gemma-2-27b-it
mistralai/Mixtral-8x7B-Instruct-v0.1
mistralai/Mistral-7B-Instruct-v0.3
mistralai/Mixtral-8x22B-Instruct-v0.1
NousResearch/Hermes-3-Llama-3.1-8B
NousResearch/Hermes-3-Llama-3.1-70B
NousResearch/Hermes-3-Llama-3.1-405B
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO

Creating a new GPU Droplet with any of these models only requires the process of setting up a GPU droplet, as shown here.

Watch the following video for a full step-by-step guide to creating a 1-Click Model GPU Droplet, and check out this article for more details on launching a new instance.

Once you have set up your new machine, navigate to the next section for more detail on interacting with your 1-Click Model.

Interacting with the 1-Click Model Deployment

Connecting to the 1-Click Model Deployment is simple if we want to interact with it on the same machine. “When connected to the HUGS Droplet, the initial SSH message will display a Bearer Token, which is required to send requests to the public IP of the deployed HUGS Droplet. Then you can send requests to the Messages API via either localhost if connected within the HUGS Droplet, or via its public IP.” (Source). To access the Droplet on other machines then, we will require getting the Bearer Token. Connect to your machine using SSH to get a copy of the token, and save it for later. If we are just wanting to interact with the inference endpoint from our GPU Droplet, things are pretty simple. The variable is already saved to the environment.

Once the Bearer Token variable is set on the machine we are choosing to use, we can begin inferencing with the model. There are two routes to do this with at the moment: cURL and the Python. The endpoint will be automatically run from the port 8080, so we can default requests to our machine. If we are using a different machine, change the localhost value below to the IPv4 address.

cURL

curl http://localhost:8080/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"user","content":"What is Deep Learning?"}],"temperature":0.7,"top_p":0.95,"max_tokens":128}}' \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $BEARER_TOKEN"

This code will ask the model “What is Deep Learning?” and issue a response in the following format:

{"object":"chat.completion","id":"","created":1731532721,"model":"hfhugs/Meta-Llama-3.1-8B-Instruct","system_fingerprint":"2.3.1-dev0-sha-169178b","choices":[{"index":0,"message":{"role":"assistant","content":"**Deep Learning: A Subfield of Machine Learning**\n=====================================================\n\nDeep learning is a subfield of machine learning that focuses on the use of artificial neural networks to analyze and interpret data. It is inspired by the structure and function of the human brain and is particularly well-suited for tasks such as image and speech recognition, natural language processing, and data classification.\n\n**Key Characteristics of Deep Learning:**\n\n1. **Artificial Neural Networks**: Deep learning models are composed of multiple layers of interconnected nodes or \"neurons\" that process and transform inputs into outputs.\n2. **Non-Linear Transformations**: Each layer applies a non-linear"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":40,"completion_tokens":128,"total_tokens":168}}

This can then be plugged into a variety of web development applications as needed.

Python

The model can also Pythonically be accessed using either the Hugging Face Hub or OpenAI packages. We are going to refer to the Hugging Face Hub reference code for this demonstration.

### Hugging Face Hub
import os
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://localhost:8080", api_key=os.getenv("BEARER_TOKEN"))

chat_completion = client.chat.completions.create(
    messages=[
        {"role":"user","content":"What is Deep Learning?"},
    ],
    temperature=0.7,
    top_p=0.95,
    max_tokens=128,
)

This will return a formatted response as a ChatCompletionOutput object.

## HuggingFace Hub
ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content='**Deep Learning: An Overview**\n\nDeep Learning is a subset of Machine Learning that involves the use of Artificial Neural Networks (ANNs) with multiple layers to analyze and interpret data. These networks are inspired by the structure and function of the human brain, with each layer processing the input data in a hierarchical manner.\n\n**Key Characteristics:**\n\n1.  **Multiple Layers:** Deep Learning models typically have 2 or more hidden layers, allowing them to learn complex patterns and relationships in the data.\n2.  **Neural Networks:** Deep Learning models are based on artificial neural networks, which are composed of interconnected nodes (neurons) that process', tool_calls=None), logprobs=None)], created=1731532948, id='', model='hfhugs/Meta-Llama-3.1-8B-Instruct', system_fingerprint='2.3.1-dev0-sha-169178b', usage=ChatCompletionOutputUsage(completion_tokens=128, prompt_tokens=40, total_tokens=168))

We can print just the output with:

chat_completion.choices[0]['message']['content']

Creating a Voice Enabled Personal Assistant

To make the best use of this powerful new tool, we have developed a new personal assistant application to run with the models. The application is fully voice enabled, capable of listening to and reading back out loud inputs and outputs. To make this possible, the demo uses Whisper to transcribe an audio input, or takes plain text, and inputs that to an the LLM powered by 1-Click GPU Droplets to generate a text response. We then use Coqui-AI’s XTTS2 model to convert the text input into a understandable audio output. It’s worth noting that the software uses voice cloning to generate the output audio, so users will receive a voice output close to their own speaking voice.

Take a look at the code below:

import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
from threading import Thread
import os
from huggingface_hub import InferenceClient
import gradio as gr
import random
import time
from TTS.api import TTS
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import scipy.io.wavfile as wavfile
import numpy as np


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id_w = "openai/whisper-large-v3"

model_w = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id_w, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model_w.to(device)

processor = AutoProcessor.from_pretrained(model_id_w)

pipe_w = pipeline(
    "automatic-speech-recognition",
    model=model_w,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

client = InferenceClient(base_url="http://localhost:8080", api_key=os.getenv("BEARER_TOKEN"))

# Example voice cloning with YourTTS in English, French and Portuguese
# tts = TTS("tts_models/multilingual/multi-dataset/bark", gpu=True)

# get v2.0.2
tts = TTS(model_name="xtts_v2.0.2", gpu=True)

with gr.Blocks() as demo:
    chatbot = gr.Chatbot(type="messages")
    with gr.Row():
        msg = gr.Textbox(label = 'Prompt')
        audi = gr.Audio(label = 'Transcribe audio')
    with gr.Row():
        submit = gr.Button('Submit')
        submit_audio = gr.Button('Submit Audio')
        read_audio = gr.Button('Transcribe Text to Audio')
        clear = gr.ClearButton([msg, chatbot])
    with gr.Row():
        token_val = gr.Slider(label = 'Max new tokens', value = 512, minimum = 128, maximum = 1024, step = 8, interactive=True)
        temperature_ = gr.Slider(label = 'Temperature', value = .7, minimum = 0, maximum =1, step = .1, interactive=True)
        top_p_ = gr.Slider(label = 'Top P', value = .95, minimum = 0, maximum =1, step = .05, interactive=True)

    def respond(message, chat_history, token_val, temperature_, top_p_):
        bot_message = client.chat.completions.create(messages=[{"role":"user","content":f"{message}"},],temperature=temperature_,top_p=top_p_,max_tokens=token_val,).choices[0]['message']['content']
        chat_history.append({"role": "user", "content": message})
        chat_history.append({"role": "assistant", "content": bot_message})
        # tts.tts_to_file(bot_message, speaker_wav="output.wav", language="en", file_path="output.wav")

        return "", chat_history, #"output.wav"
    
    def respond_audio(audi, chat_history, token_val, temperature_, top_p_):  
        wavfile.write("output.wav", 44100, audi[1]) 
        result = pipe_w('output.wav')
        message = result["text"]
        print(message)
        bot_message = client.chat.completions.create(messages=[{"role":"user","content":f"{message}"},],temperature=temperature_,top_p=top_p_,max_tokens=token_val,).choices[0]['message']['content']
        chat_history.append({"role": "user", "content": message})
        chat_history.append({"role": "assistant", "content": bot_message})
        # tts.tts_to_file(bot_message, speaker_wav="output.wav", language="en", file_path="output2.wav")
        # tts.tts_to_file(bot_message,
                # file_path="output.wav",
                # speaker_wav="output.wav",
                # language="en")
        return "", chat_history, #"output.wav"
    def read_text(chat_history):
        print(chat_history)
        print(type(chat_history))
        tts.tts_to_file(chat_history[-1]['content'],
                file_path="output.wav",
                speaker_wav="output.wav",
                language="en")
        return 'output.wav'


    msg.submit(respond, [msg, chatbot, token_val, temperature_, top_p_], [msg, chatbot])
    submit.click(respond, [msg, chatbot, token_val, temperature_, top_p_], [msg, chatbot])
    submit_audio.click(respond_audio, [audi, chatbot, token_val, temperature_, top_p_], [msg, chatbot])
    read_audio.click(read_text, [chatbot], [audi])
demo.launch(share = True)

Put together, this integrated system makes it possible to take full advantage of the speed and availability of a cloud GPU to act as a personal assistant for all kinds of tasks. We have been using it in place of popular closed source tools like Gemini and ChatGPT, and have really been impressed with the results.

Setting up & Running the Demo

To install the required packages onto your GPU Droplet, paste the following into the terminal:

pip install gradio tts huggingface_hub transformers datasets scipy torch torchaudio

To run this demo, simply paste the code above into a blank python file (let’s arbitrarily call it app.py) on your 1-Click Model enabled Cloud GPU, and run it with python3 app.py.

Closing Thoughts

The personal assistant application developed for this tutorial has already proven useful for us in our daily lives, and we hope others can find some utility using them. Furthermore, the new 1-Click Model GPU Droplets offer a really interesting alternative to enterprise LLM software. While costly for single users, there are a number of use cases we can think of (namely running the largest open-source LLMs) that can justify the expenditure. Our new offerings have the largest Mixtral and LLaMA models available, so it is an interesting opportunity to test the power of these models against the best competition.

Thank you for reading!

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products