AI tools are useful for manipulating images, audio, or video to produce a novel result. Until recently, automatically editing images or audio was challenging to implement without using a significant amount of time and computing power, and even then it was often only possible to run turnkey filters to remove certain frequencies from sounds or change the color palette of images. Newer approaches, using AI models and enormous amounts of training data, are able to run much more sophisticated filtering and transformation techniques.
Spleeter and Whisper are open source AI tools that are designed for audio analysis and manipulation. Both were developed and released along with their own pre-trained language models, making it possible to run them directly on your own provided input, such as MP3 or AAC audio files, without any additional configuration. Spleeter is used to separate vocal tracks from instrumental tracks of music. Whisper is used to generate subtitles for spoken language. They both have many uses individually, and they have a particular use together: they can be used to generate karaoke tracks from regular audio files. In this tutorial, you’ll use Whisper and Spleeter together to make your own karaoke selections, or integrate into another application stack.
These tools are available on most platforms. This tutorial will provide installation instructions for a Ubuntu 22.04 server, following our guide to Initial Server Setup with Ubuntu 22.04. You will need at least 3GB of memory to run Whisper and Spleeter, so if you are running on a resource-constrained server, you should consider enabling swap for this tutorial.
Both Spleeter and Whisper are Python libraries and require you to have installed Python and pip
, the Python package manager. On Ubuntu, you can refer to the Step 1 of How To Install Python 3 and Set Up a Programming Environment on an Ubuntu 22.04 Server.
Additionally, both Spleeter and Whisper use machine learning libraries that can optionally run up to 10-20x more quickly on a GPU. If a GPU is not detected, they will automatically fall back to running on your CPU. Configuring GPU support is outside the scope of this tutorial, but should work after installing PyTorch in GPU-enabled environments.
First, you’ll need to use pip
, Python’s package manager, to install the tools you’ll be using for this project. In addition to spleeter
, you should also install youtube-dl
, a script that can be used to download YouTube videos locally, which you’ll use to retrieve a sample video. Install them with pip install
:
- sudo pip install spleeter youtube-dl
Rather than installing Whisper directly, you can install another library called yt-whisper
directly from Github, also by using pip
. yt-whisper
includes Whisper itself as a dependency, so you’ll have access to the regular whisper
command after installation, but this way you’ll also get the yt-whisper
script, which makes downloading and subtitling videos from YouTube a one-step process. pip install
can parse Github links to Python repositories by preceding them with git+
:
- sudo pip install git+https://github.com/m1guelpf/yt-whisper.git
Finally, you’ll want to make sure you have ffmpeg
installed to do some additional audio and video manipulation. ffmpeg
is a universal tool for manipulating, merging, and reencoding audio and video files. On Ubuntu, you can install it using the system package manager by running an apt update
followed by apt install
:
- sudo apt update
- sudo apt install ffmpeg
Now that you have the necessary tools installed, you’ll obtain sample audio and video in the next step.
youtube-dl
, which you installed in Step 1, is a tool for downloading videos from YouTube to your local environment. Although you should take care when using potentially copyrighted material out of context, this can be useful in a number of contexts, especially when you need to run some additional processing on videos or use them for source material.
Using youtube-dl
, download the video that you’ll be using for this tutorial. This sample link is to a public domain song called “Lie 2 You”, but you can use another:
- youtube-dl https://www.youtube.com/watch?v=dA2Iv9evEK4&list=PLzCxunOM5WFJxaj103IzbkAvGigpclBjt
youtube-dl
will download the song along with some metadata and merge it into a single .webm
video file. You can play this video in a local media player such as mpv, but that will depend on your environment.
Note: Because the use of youtube-dl
is not explicitly supported by YouTube, downloads can occasionally be slow.
Next, you’ll separate the audio track from the video you just downloaded. This is a task where ffmpeg
excels. You can use the following ffmpeg
command to output the audio to a new file called audio.mp3
:
- ffmpeg -i "Lie 2 You (ft. Dylan Emmet) – Leonell Cassio (No Copyright Music)-dA2Iv9evEK4.webm" -c:a libmp3lame -qscale:a 1 audio.mp3
This is an example of ffmpeg
command syntax. In brief:
-i /path/to/input
is the path to your input file, in this case the .webm
video you just downloaded-c:a libmp3lame
specifies an audio codec to encode to. All audio and video needs to be encoded somehow, and libmp3lame
is the most common mp3 encoder.qscale:a 1
specifies the bitrate of your output mp3, in this case corresponding to a variable bit rate around 220kbps. You can review other options in the ffmpeg documentation.audio.mp3
is the name of your output file, presented at the end of the command without any other flags.After running this command, FFmpeg will create a new file called audio.mp3
.
Note: You can learn more about ffmpeg
options from ffmprovisr, a community-maintained catalog of ffmpeg
command examples, or refer to the official documentation.
In the next step, you’ll use Spleeter to isolate the instrumental track from your new audio.mp3
file.
Now that you have your standalone audio file, you’re ready to use spleeter
to separate the vocal track. Spleeter contains several models for use with the spleeter separate
command, allowing you to perform even more sophisticated separation of piano, guitar, drum, bass tracks and so on, but for now, you’ll use the default 2stems
model. Run spleeter separate
on your audio.mp3
, also providing a path to an -o output
directory:
- spleeter separate -p spleeter:2stems -o output audio.mp3
If you are running Spleeter without a GPU, this command may take a few minutes to complete. This will produce a new directory called output
, containing two files called vocals.wav
and accompaniment.wav
. These are your separated vocal and instrumental tracks. If you encounter any errors, or need to further customize your Spleeter output, refer to the documentation.
You can try listening to these files in MPV or another audio player. They will have a relatively larger file size for now because spleeter
decodes them directly to raw WAV output, but in the next steps, you’ll encode them back into a single video.
Now that you have your instrumental audio track, you just need to generate captions from the original video. You could run whisper
directly on the .webm
video you downloaded, but it will be even quicker to run the yt_whisper
command on the original YouTube video link:
- yt_whisper https://www.youtube.com/watch?v=dA2Iv9evEK4&list=PLzCxunOM5WFJxaj103IzbkAvGigpclBjt
If you review the yt_whisper source code, you can understand the presets that yt_whisper
is passing to whisper
to generate captions from a YouTube video. For example, it defaults to the --model small
parameter. The Whisper documentation suggests that this model provides a good tradeoff between memory requirements, performance, and accuracy. If you ever need to run whisper
by itself on another input source or with different parameters, you can use these presets as a frame of reference.
If you are running Whisper without a GPU, this command may take a few minutes to complete. This will generate a caption file for the video in the .vtt
format. You can inspect the captions using head
or a text editor to verify that they match the song lyrics:
- head -20 Lie_2_You__ft__Dylan_Emmet____Leonell_Cassio__No_Copyright_Music.vtt
OutputWEBVTT
00:00.000 --> 00:07.000
I need feeling you on me And I guess in a way you do
00:07.000 --> 00:19.000
All my breath on revelin' emotions I need some space to think this through
00:19.000 --> 00:29.000
Call me all night long Try to give you hints in a hard to see
00:29.000 --> 00:39.000
Right on the line, no Losing it on you is the last thing I need
00:39.000 --> 00:49.000
If I'm honest, I'll just make you cry And I don't wanna fight with you
00:49.000 --> 00:57.000
I would rather lie to you But if I'm honest, now's not the right time
You now have your separate audio tracks and your caption file. In the final step, you’ll assemble them all back together using ffmpeg
.
Finally, it’s time to combine your outputs into a finalized video containing 1) the original background video, 2) the isolated instrumental track you generated using Spleeter, and 3) the captions you generated using Whisper. This can be done with a single, slightly more complicated, ffmpeg
command:
- ffmpeg -i "Lie 2 You (ft. Dylan Emmet) – Leonell Cassio (No Copyright Music)-dA2Iv9evEK4.webm" -i output/audio/accompaniment.wav -i "Lie_2_You__ft__Dylan_Emmet____Leonell_Cassio__No_Copyright_Music.vtt" -map 0:v -map 1:a -map 2 -metadata:s:s:0 language=eng -c:v copy -c:a aac -c:s mov_text final.mp4
Unlike the earlier ffmpeg
command, this command is using three different inputs: the .webm
video, the .wav
audio, and the .vtt
captions. It uses several map
arguments to map the first (or 0th, counting from 0) input to the video track, then to the audio, and the last to subtitle metadata, like so: -map 0:v -map 1:a -map 2 -metadata:s:s:0 language=eng
. Next, it specifies the codecs being used for each track:
c:v copy
means that you are preserving the original video source and not reencoding it. This usually saves time and preserves video quality (video encoding is usually the most CPU-intensive use of ffmpeg
by far) as long as the original source is in a compatible format. youtube-dl
will almost always default to using the common H264 format, which can be used for streaming video, standalone .mp4
files, Blu Ray discs, and so on, so you should not need to change this.
c:a aac
means that you are reencoding the audio to the AAC format. AAC is the default for most .mp4
video, is supported in virtually all environments, and provides a good balance between file size and audio quality.
c:s mov_text
specifies the subtitle format you are encoding. Even though your subtitles were in vtt
format, mov_text
is a typical subtitle format to embed within a video itself.
Note: You may also want to offset your subtitles up by a couple seconds to help viewers anticipate which lines are coming next. You can do this by adding -itsoffset -2
to the ffmpeg
command.
Finally, you provide an output format, final.mp4
. Notice that you did not actually specify .mp4
output other than in this filename — ffmpeg
will automatically infer an output format based on the output path you provide. When working with audio and video files, the codecs you use are generally more important than the file types themselves, which act as containers for the content. The important differences are in which video players expect to be able to read which kinds of files. An .mp4
file containing H264 video and AAC audio is, as of this writing, the most common media file used anywhere, and will play in almost any environment, including directly in a browser without needing to download the file or configure a streaming server, and it can contain subtitles, so it is a very safe target. .mkv
is another popular container format that supports more features, but it is not as widely deployed.
Your final.mp4
video can now be downloaded, shared, or projected up on the wall for karaoke night. Good luck with your performance!
You now have an end-to-end karaoke video solution using four tools. These can be combined into a standalone script, integrated into another application, or run interactively as needed.
In this tutorial, you used two machine learning tools to create a separated vocal track and a set of captions from a source video, then joined them back together. This is uniquely useful for making karaoke videos from existing audio sources, but can also be applied to many other tasks.
Next, you may want to configure a video streaming server, or experiment with some other AI or Machine Learning libraries.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
I was looking for how to install whisper, I managed to complete all the steps and everything looked fine, but the captions never end in the final file, is just the Video without the Vocals, but no captions… I guess for my karaoke singers will have to know the lyrics xD
Great tutorial! i am totally impressed from step-by-step instructions on using Whisper and Spleeter for creating karaoke videos. and you clearly explained it. Good Job