Tutorial

An Overview of Sesame’s Conversational Speech Model

Published on April 12, 2025

AI/ML

An Overview of Sesame’s Conversational Speech Model

Introduction

Widespread voice interfaces have the potential to drastically impact how we interact with technology. The issue is that these voice models are not yet efficient or intelligent enough to be seamlessly integrated with our devices. An effective voice model needs to be able to understand context, handle ambiguity, and pick up subtle nuances in tone while also responding timely and appropriately.

Where are we now?

While traditional voice pipelines, where we have a speech-to-text (STT) model followed by a Large Language Model (LLM) and a text-to-speech (TTS) model, are capable of generating increasingly natural responses, they often come with latency challenges and are limited in their context, making sustained natural conversation difficult.
In a previous article, we explored various TTS models including F5-TTS, Kokoro TTS, Spark TTS, and Sesame CSM, highlighting their capabilities and limitations across multilingual speech generation and voice cloning. This article will take a closer look at Sesame CSM, with a step-by-step implementation guide for deploying the model on a DigitalOcean GPU Droplet.

Sesame CSM

Sesame seeks to solve the gap in high quality conversational models with their Conversational Speech Model (CSM), which aims to generate more natural and contextually appropriate speech by considering the history of the conversation.

Sesame intends to achieve “voice presence” which means factoring in various aspects of effective communication into their models such as emotional intelligence, timing, pauses, interruptions, emphasis, tone, style, consistency, etc.

While Sesame acknowledges that they haven’t quite achieved voice presence, we’re very excited about their potential to drastically improve the landscape of speech models. Their demo involves two voices Maya and Miles, developed by fine-tuning their base model for friendliness and expressivity. We encourage you to check out their demo.
demo

Primer on Audio Tokenization

One method for processing audio with transformer models involves converting continuous audio waveforms into sequences of discrete tokens using audio tokenizers. This transformation allows transformer architectures, which traditionally work with discrete data, to effectively model and generate audio content.

Traditionally, there are two types of audio tokens: semantic tokens, which are speaker-invariant and capture meaning, and acoustic tokens, which are fine-grained acoustic encodings often generated from Residual Vector Quantization (RVQ).

Residual Vector Quantization

Vector Quantization (VQ) is a data compression technique initially used in signal processing and later also in machine learning to approximate high-dimensional vectors using a smaller, finite set of representative vectors known as codewords, which are stored in a structure called a codebook. RVQ builds upon the foundation of VQ by employing a multi-stage quantization process. Instead of using a single codebook, RVQ utilizes a series of codebooks, where each codebook iteratively quantizes the residual error, which is the difference between the original vector and the approximation obtained from the previous stage.

RVQ methods suffer from a significant delay issue due to their sequential codebook processing. With N codebooks, it takes N forward passes before the first audio chunk can be decoded, making them impractical for real-time applications despite being acceptable in offline contexts like audiobooks.

CSM Model Architecture

Sesame CSM is an end-to-end multimodal text and speech model comprised of two autoregressive transformers: a multimodal backbone and an audio decoder. Both of these transformers are variants of the Llama architecture.

Since CSM is a transformer-based architecture, it requires tokens for processing. Sesame uses a Llama tokenizer for generating text tokens and a split-RVQ tokenizer (Mimi) for producing one semantic codebook and N – 1 acoustic codebooks per frame at 12.5 Hz.

The figure below depicts the split-RVQ tokenizer which was also used in Moshi: a speech-text foundation model for real-time dialogue mimi

During inference, the multimodal backbone is sequentially fed interleaved text and audio tokens.

Sesame faces training challenges due to high memory demands. Its audio decoder processes batches of B (batch size) × S (sequence length) and N (number of RVQ codebook levels) autoregressively, slowing training and limiting scalability.

To address this, Sesame uses compute amortization to reduce memory usage while maintaining full RVQ codebook fidelity. Specifically, the audio decoder is trained on a random 1/16 subset of frames, while the zeroth codebook is updated on every frame, achieving this efficiency without any noticeable loss in performance. compute

Try it yourself

Sesame is releasing an open-source base generation model that hasn’t been fine-tuned for any specific voice. While the current model only supports one language, Sesame plans to expand language support to over 20 languages in future releases. The models will be open-sourced under the Apache 2.0 license.

Feel free to try out the current model in their Hugging Face space or access their code on GitHub.

Implementing the model on a DigitalOcean GPU Droplet

Begin by setting up a DigitalOcean GPU Droplet, select AI/ML and choose the NVIDA H100 option ai-ml droplet

Furthermore, you will need access to the models which can be obtained from the model’s HuggingFace page. model permission

Additionally, a Hugging Face Token is necessary to run this model, which can be obtained from the Hugging Face Access Token page. Note that you may need to create a Hugging Face account.
When creating a token, make sure to check off “Read access to contents of all public gated repos you can access”

check token

git clone https://github.com/SesameAILabs/csm
cd csm
pip install -r requirements.txt
export NO_TORCH_COMPILE=1
huggingface-cli login
python run_csm.py

The output to run_csm.py will be a .wav file that you can open up and listen to. Lines 87-90 can be modified to alter the conversation to your liking. edit conversation

Conclusion

This article explored Sesame’s Conversational Speech Model (CSM) and provided instructions for deploying it on a DigitalOcean GPU Droplet.
Sesame’s CSM seeks to improve digital conversation by weaving in the warm pauses, gentle tone shifts, and thoughtful responses we cherish in human conversation. An effective voice interface goes a long way in improving how we interact with technology – whether that be through smart devices or digital assistants.

References
Sesame:Crossing the uncanny valley of conversational voice

Paper: Recent Advances in Discrete Speech Tokens: A Review

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products