How to transcribe long audios fast with open source (colab included)

In this post I will give you code that you can run yourself in Colab (or on your own machine with a GPU), that will allow you to very quickly transcribe long audios in many languages.

The setup uses the whisper model from huggingface, Google Drive and Colab.

I used it to transcribe multi hour podcasts within a couple of minutes (about 1-5 minutes depending on the length).

Motivation

Some time ago I was getting frustrated that the podcasts that I was listening to didn’t have transcriptions available.

Let’s pick this one as an example - it’s a massive, almost 3 hour long podcast by Huberman. It’s knowledge packed!

Could I get the transcription? It’s not included anywhere, but I could make my own.

Sure, there are services for it, one that looked promising to me is listen411 - it charges 0.06 USD for 1 minute of audio summarization + 1 USD per file, interesting!

A podcast like one I was interested in would cost me about 13 USD, because it’s a very long one. I probably listen to 5 podcasts per week or more, so let’s say 20 per month. Ugh, the costs could add up fast.

If you go one level down, there are also APIs from cloud providers, which could be a good alternative. I checked the costs of transcription in google and aws: 0.024 USD or 0.016 USD per minute respectively, not bad, but it’s still about 1-2 USD per long podcast.

So I wondered, can I do better myself? Based on my knowledge of the current state of the ML, audio transcription is a pretty much a solved problem and there are excellent models available publicly for free. With open source libraries and Colab (easily accessible GPUs) I could build a DIY solution that would be much cheaper.

Cost of running the DIY solution

The costs of the DIY solution are pretty much just the Colab running costs. When you break it down, it’s about 3 orders of magnitude cheaper - provided that it’s a fast process! It can also be totally free if you use the free tier of the Colab service.

Colab has a free tier that I think could be sufficient for that, but if you want a more guaranteed access to resources, you can use the paid version of Colab as well. the Colab GPU pricing is 0.196/hr USD (T4 as of 2024).

A cost of ~(one podcast) transcription - ~3minutes GPU time (T4) - approximately 0.01 USD.

And if you were to scale it, you could also run on (self) hosted GPU driving the cost even lower.

Obstacles on the way

The code that I got at the end is very fast and provides high quality results, but it took some trial and error to get things right…

Major issues I encountered were slowness and poor quality of results.

I hate slow running models, I have no patience for long running pipelines - it also gets expensive fast as you pay for the computation time.

My plan was to use huggingface (as it’s a great ecosystem) and I used a tutorial with speaker diarization as a starting point. Dirarizaton lets you annotate the transcription with different speakers and when they speak automatically. It appeared relevant and promising.

But in practice copying the approach with diarization wasn’t feasible due to slowness for audios longer than 5 minutes. It just seemed stuck for hours without showing much progress. I spent quite a long time waiting for the transcriptions.

I also spend a good bunch of time fiddling around and learning API details of hugging face libraries, that tutorial confused me more than it helped.

I wondered if the models I was using were too expensive to run and I played with different models trying to assess if I can get my transcriptions faster, and some results weren’t the best…

Repeated

In the end, the breakthrough resource that helmed me was studying the Insanely Fast Whisper repository.

Especially this notebook.

Insights from that notebook were enough to simplify my code greatly and speed it up dramatically!

DIY solution

The setup

The overall solution I built and I’m sharing is:

upload audio files to a personal Google Drive Directory (either manually or another script, e.g. RSS feed client) - the directory is called audiotranscriptions
mount Google Drive in Colab and scan the directory for audio files missing a transcription file
transcribe audios that needed transcriptions and save transcriptions as txt files in Google Drive

transcriptions

Transcription logic

Without performance optimizations the code is as simple as:

import torch
from transformers import pipeline


pipe = pipeline("automatic-speech-recognition",
                "openai/whisper-large-v2",
                device="cuda:0")

filename = "./myfile.mp3"
transcribed_text = pipe(filename)

As the pipeline is so simple and it uses whisper (state of the art model!), it’s already pretty good, but it can be improved.

using half precision - this will let the model use less GPU memory
chunking and batching - chunking will dramatically decrease complexity as the computation time usually is usually not linearly proportional to the duration, but a higher order polynomial. Chunking puts the limit on that time, it’s like processing a bunch of smaller audios and stitching them together. Batching helps to parallelize that. Additionally batching goes hand in hand with using lower precision as if a single batch takes less memory, you can have more batches at once.

The code modifications are very simple.

Enabling half precision:

import torch
from transformers import pipeline


pipe = pipeline("automatic-speech-recognition",
                "openai/whisper-large-v2",
                torch_dtype=torch.float16,
                device="cuda:0")

Enabling batching and chunking:

filename = "./myfile.mp3"
transcribed_text = pipe(filename,
                        chunk_length_s=30,
                        batch_size=16,
                        return_timestamps=True)

Helper logic

The rest of the code from the attached notebook glues it all together. Here are the descriptions of the main parts.

Mount the Google Drive and navigate to the directory audiotranscriptions containing files needing transcriptions:

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

%cd /content/drive/MyDrive/audiotranscriptions

Utility functions for the logic if the file needs a transcription:

from pathlib import Path, PosixPath


def get_file_prefix(filename: PosixPath):
  return filename.name.split(".")[0]

def needs_transcription(audio: PosixPath, transcriptions: set[str]):
  return get_file_prefix(audio) not in transcriptions

transcribe_file is calling the previously created pipeline for a given audio filename and saves the transcription under transcription_filename.

def transcribe_file(pipe, filename, transcription_filename):
  print("transcribing", filename)
  outputs = pipe(filename,
                chunk_length_s=30,
                batch_size=16,
                return_timestamps=True)
  text = outputs["text"]

  with open(transcription_filename, "w") as f:
    f.write(text)

def transcribe_all(pipe, audios, transcriptions):
  for audio in audios:
    if needs_transcription(audio, transcriptions):
      transcription_filename = audio.with_suffix(".txt")
      transcribe_file(pipe, audio.name, transcription_filename)
      transcriptions.add(get_file_prefix(audio))

Putting it together:

scan the directory
find audio and text files
transcribe whatever is needed

files = Path.cwd().iterdir()
audios = [file for file in files if file.suffix in [".mp3", ".wav"]]
transcriptions = {get_file_prefix(file) for file in Path.cwd().iterdir() if file.suffix == ".txt"}
transcribe_all(pipe, audios, transcriptions)

Code & results

You can see the colab notebook here.

I have tested the script on podcasts using English as well as Spanish and Polish and the transcriptions looked very good to my naked eye. I haven’t done a deeper quality assessment.

What’s next?

After you have the text, the fun just starts!

I have further used LLMs to strip the transcripts from promotional content, make summaries and extract wisdom out of the longer audio. But that is a topic for another article!