Nils Durner's Blog Ahas, Breadcrumbs, Coding Epiphanies

Transcription and Diarization with new GPT-4o

OpenAI have released a new speech-to-text model that also supports diarization (speaker separation and attribution): GPT-4o Transcribe Diarize (aka “gpt-4o-transcribe-diarize”). It’s priced the same (estimated) cost of regular gpt-o4-transcribe and Whisper. Immediate findings:

  • chunk limit of 1400 seconds
  • not available over the Realtime API
  • prompting, e.g. to establish abbreviations or prior chunks, is not supported
  • asked for the Word-Error-Rate WER compared to other models, Peter Bakkum @OpenAI clarified: “Would consider it roughly comparable but would default to the other models if you don’t need diarization”. Artificial Intelligence reports a WER of 21% for GPT-4o Transcribe, but notes that “Compared to Whisper, GPT-4o Transcribe smooths transcripts, which results in lower word-for-word accuracy, especially on less structured speech (e.g., meetings, earnings calls).”
  • speaker attribution works by specifying references into the audio file. If no references were given, speaker labels are returned as “A:”, “B: “, etc. - so the model doesn’t learn and assign automatically from the audio. This is different from Google Gemini.
  • a quick check at the office shows hallucination and omissions with a short chat in German language (see below)

Implementations

An implementation with streaming support (but limited to 23 minutes) is part of my OAI Chat. A simplistic stand-alone implementation looks like this (sample output below):

import argparse
import base64
import mimetypes
from pathlib import Path
from typing import Dict, List, Optional

from openai import OpenAI


def to_data_url(path: Path) -> str:
    """
    Convert an audio file to a base64 data URL using the best-effort MIME type.
    """
    mime_type, _ = mimetypes.guess_type(path.name)
    mime_type = mime_type or "application/octet-stream"
    with path.open("rb") as handle:
        encoded = base64.b64encode(handle.read()).decode("utf-8")
    return f"data:{mime_type};base64,{encoded}"

def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Transcribe an audio file with diarization using OpenAI Realtime."
    )
    parser.add_argument(
        "audio_path",
        type=Path,
        help="Path to the audio file (e.g., an .m4a recording).",
    )
    return parser.parse_args()


def main() -> None:
    args = parse_args()

    audio_path: Path = args.audio_path.expanduser().resolve()
    if not audio_path.exists():
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    client = OpenAI()

    with audio_path.open("rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="gpt-4o-transcribe-diarize",
            file=audio_file,
            response_format="diarized_json",
        )

    for segment in transcript.segments:
        print(segment.speaker, segment.text, segment.start, segment.end)


if __name__ == "__main__":
    main()

Sample output with two speakers, showing erroneous additions with strikethrough and erroneous omissions in bold:

A Hallo, 0.95 1.5000000000000002
A ich 1.6000000000000003 2.0500000000000007
B Hallo, 2.0500000000000007 2.7
A ich bin Nils. 2.8 4.2
B ich bin Mascha. 5.25 5.95
A Tschüss, Mascha. 7.1000000000000005 8.200000000000001
B Tschüss