Caption Genie — AI Transcription Service

The Problem

When you upload a podcast episode or a course recording, you usually need three artifacts: the audio, the transcript, and a captions file that lines up with the audio. Most tools do one cleanly. The rest cost minutes per file in human cleanup.

Caption Genie was an attempt to do all three from a single upload, with a caption editor that respected what real editors actually do — fixing speaker labels, splitting lines on breath pauses, exporting the result in the format the downstream tool wants.

The Build

A pnpm monorepo with three publishable packages:

@caption-genie/frontend — Vite + React app. The editor lives here.
@caption-genie/backend — Node API service. Auth, upload, queue.
caption-genie — the umbrella package that ties them and the deployment scripts together.

Transcription ran through a Whisper pipeline. Output was a structured caption tree (segments → lines → tokens) the editor could manipulate without losing word-level timing. Storage went through Supabase.

What I Learned

This was the project that convinced me to stop building “transcribe and export” tools and start treating captions as a graph the editor walks through. Word-level timing is a property of the token, not the line — once you model it that way, every operation (split, merge, retime, speaker reassignment) becomes trivial. Every transcription tool I’ve touched since has used some version of this internal model.

Shelved when faster commercial alternatives (Whisper-hosted services) collapsed the moat. The editor model survived.