Tutorial

Choosing the Best Text-to-Speech Models: F5-TTS, Kokoro, SparkTTS, and Sesame CSM

Published on April 2, 2025

Technical Evangelist // AI Arcanist

Choosing the Best Text-to-Speech Models: F5-TTS, Kokoro, SparkTTS, and Sesame CSM

Large Language Modeling has been, for very good reason, one of the most prominent and effective results to come from the AI revolution. These models have enabled numerous applications in different fields, including knowledgeable chatbots, functional agents, and general text generation. Correspondingly, there has been a race to combine different modalities with the power of these models. From vision understanding to function calling to speech generation, the race has been on to make these models even more connective and useful.

One of the awesome, potential use-cases for Large Language Models is generating large swathes of text for audio subject matter, like podcasts, scripts, or even entire stories. With that, comes an interesting question: can AI make human sounding audio generations?

In this article, we are going to review four of the best, open-source Text-to-Speech (TTS) models. Specifically, we will compare the effectiveness of F5-TTs, Kokoro, SparkTTS, and the newly released Sesame at generating a paragraph of speech audio. We will both make a qualitative assessment of the speech’s closeness to the input & the use of punctuation and pauses. Together, we hope these tests give a concrete answer as to which model might be the best for any use-case. We will also note where some models are faster than others, though they are almost all blindingly fast.

Kokoro

Kokoro is the first TTS model we are going to cover in this short review. We don’t know much about Kokoro because their has never been a paper release. What we do know about Kokoro can be gleaned from the HuggingFace Model Card. We can see that the architecture is based on the work of StyleTTS2.

Kokoro is an extremely light-weight TTS model with an Apache-license. With only 82 million parameters, the model can be deployed on all sorts of environments, from production to edge devices to personal projects. The model is multilingual, and capable of generating voices in a plethora of languages including Japanese, Hindi, and Thai. Notably, the model doesn’t have the native voice cloning capabilities that the other models we will look at have. Instead, there is a library of available voices, represented as tensor objects to the model, for the user to use. Each of these is curated and effective.

Kokoro was trained entirely on public domain audio, all of which was licensed under the “Apache, MIT, etc” licenses. Interestingly, the audio data used to train the model is composed of less than a thousand hours of recorded audio. This enabled the model to be trained for only 1000 USD on NVIDIA A100 GPUs, a real accomplishment in the time of expensive LLMs, image and video models.

Running Kokoro TTS on a GPU Droplet

Kokoro TTS is so efficient when executed with Python code, and the GPU Droplet from DigitalOcean is so powerful, that Kokoro can generate speech faster than it can be spoken. To see and hear this, we can spin up a GPU Droplet and set up our environment with access to Jupyter Lab. Follow this guide for more detailed instructions on doing so.

Once your droplet has spun up, we can get started by cloning the repository onto our machine. We will then install the required packages onto a virtual environment, and launch a web application GUI they provide to serve the TTS models. We can do this all at once by pasting the following into the terminal:

git clone https://github.com/hexgrad/kokoro
cd kokoro
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cd demo/
python app.py --share

This will launch a Gradio web application deploying the Kokoro TTS models. With this, we are given a wide variety of voice options we can use to generate our speech, in both male and female voices. Additionally, to help provide examples for speech, there is a random quote generator and two book quote generators from Frankenstein and the Great Gatsby. We suggest testing around with the available voices to see how they compare with one another.

In our testing, Kokoro generated a 30 second speech sample, using the third paragraph in this article, in less than a second. The audio was of high quality with very little distortion, and can be generated all at once or streaming. The model was excellent at handling both pauses and punctuation, and sounded human-like. Our only minor complaint about the audio would be that it still retains a stilted, emotionless delivery that makes it obvious the sample is generated with AI.

Kokoro proves to be a potent TTS system, but it lacks some of the capabilities that other models we cover in this article have as well. Notably, Kokoro cannot do voice cloning from an audio sample. Our next example, F5-TTS, excels at voice cloning.

SparkTTS

The next TTS model system we will cover is Spark TTS. Spark is a novel system powered by the novel innovation BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. Unlike previous traditional systems, this binary encoding procedure ensures that the generated audio better mimics the reference speech sample.

“This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). Their experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis.” (Source)

Run Spark TTS

Similar to Kokoro, the creators of Spark TTS provided a convenient web demo for us to run their models. Paste the following code in to set up the environment, download all the pretrained model files, and then run the web demo.

git clone https://github.com/SparkAudio/Spark-TTS
cd Spark-TTS/
pip install -r requirements.txt
mkdir pretrained_models
apt-get install git-lfs
cd pretrained_models/ 
git-lfs clone https://huggingface.co/SparkAudio/Spark-TTS-0.5B
cd ..
python webui.py –device 0

Once the web demo is spun up, we can begin testing using our own audio samples. We recommend recording a ten second audio sample of your own voice to see how the model performs directly! In our experiments, we found Spark to be inferior to Kokoro in nearly every way. Specifically, the generation took upwards of ten seconds, the voice cloning was ineffective and frequently added regional accents, and the punctuation and pauses were handled poorly, creating long gaps between statements. While Spark shows a lot of promise because of its innovative design, we would argue that this release is not quite comparable to other SOTA TTS models.

F5-TTS

Next up is F5-TTS, our personal favorite of the releases we are covering in this article. F5 is an evolution of the previous E2, another excellent TTS model. To put it succinctly, both models are designed as a “fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation” (Source) Where they differ is that F5 was further improved by adding an initial step where they first model the input with ConvNeXt and refine the text representation, making it easy to align with the speech and then the denoising is performed for speech generation. See the training explained further in the graphic below.

For inference, the process is inverted for diffusion. To generate speech with the input content, it starts with an audio prompt’s and its mel spectrogram features, its transcription, and a text prompt describing. The audio prompt serves to provide speaker characteristics, and the text prompt is to guide the content of generated speech. The speech is then generated through diffusion using the tokens and audio prompt as guidance.

Run F5-TTS on GPU Droplets

To run F5 TTS on a GPU Droplet, we can follow much similar instructions to our previous models. There are a few more steps where we will install additional required packages, but these are provided within. Copy and paste the following into the terminal in the desired repository.

git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
Pip install –upgrade pip
Pip install ffmpeg-python
Apt-get install ffmpeg
pip install -e .
F5-tts_infer-gradio

This will launch the F5 Gradio inference wiki. We recommend testing each of the available demos including basic speech generation, multi-speaker speech generation, and voice chatting. In particular, the multi-speaker speech generation seems very impressive and interesting. Until recently, it was arguably the best model at this task.

Now let’s look at the TTS Model taking the internet by storm recently, Sesame CSM.

Sesame CSM

Sesame Conversational Speech Model (CSM) is a multimodal, text and speech model that operates directly on Residual Vector Quantization tokens that represent both semantic and acoustic elements learned from the training data. To achieve this, it uses two transformers models, split at the zeroth codebook. The first multimodal backbone processes interleaved text and audio to model the zeroth codebook. The second audio decoder uses a distinct linear head for each codebook and models the remaining N – 1 codebooks to reconstruct speech from the backbone’s representations. The decoder is significantly smaller than the backbone, enabling low-latency generation while keeping the model end-to-end.

CSM has perhaps the most impressive TTS demonstration we have found on the web on their main site. It can speak with the knowledge of a large language model, but behaves in a remarkably human-like way. We recommend trying out their demo there before trying CSM on DigitalOcean’s GPU Droplets, mainly because it has clearly been optimized further than their open source release.

Run Sesame CSM on GPU Droplets

We recommend running CSM Pythonically in order to customize the possible outputs. One way to do this is using a Jupyter Notebook, but one could alternatively execute a python script file. For our convenience, a demo script is provided. We can install the required packages and get access to the models on HuggingFace with the following command.

git clone git@github.com:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export NO_TORCH_COMPILE=1

# You will need access to CSM-1B and Llama-3.2-1B
huggingface-cli login

You will be prompted to get your custom or read-only api key at this point. Load it in, and then go to the CSM-1B and Llama-3.2-1B pages to get access. Once that’s complete, we can execute the script file or edit it with vim/nano. We recommend testing the script first.

python run_csm.py

This will generate a “full_conversation.wav” file with the audio between two speakers. We can replace the speakers with our own audio samples to do voice cloning. In our experiments, this is the best multispeaker model currently available. However, it was not as good as F5 qualitatively or with regards to word error rate on longer generations.

Choosing the Best Model for TTS

Choosing the best TTS model seems to be a factor of three main things: word-error-rate (WER) minimization, voice cloning, and acoustic tokenization of non-verbal vocal cues and tones.

With regard to the first, Kokoro and F5 are clear standouts from our experimentation. The WER was extremely low in all of our tests and the available demos they provided. We recommend Kokoro if cloning a voice is not important.

With regard to Voice Cloning, F5 and Spark are each exceptional models. We recommend F5 over Spark due to the relative improvement in voice quality of the generated speech. That being said, they both excel at voice cloning.

Finally, Sesame’s CSM is the clear winner with regard to acoustic tokenization of non-verbal vocal cues and tones. The promise of CSM with fine-tuning, as displayed in their online demo, is truly remarkable. At the present, the voice cloning and audio quality of the open-source model available just doesn’t quite meet the standard set by F5.

All of the models we covered in this review are exceptional TTS models. They each excel in different places, but at the end of the day, we recommend F5 as our favorite TTS model overall.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products