Video ProcessingWhisperNext.jsModalS3

YouTube Splitter — Chunked Transcription

A web app that ingests a YouTube URL, splits the video into configurable time-segment chunks, transcribes each chunk through Whisper, and stores the result with presigned-URL access. Built for the people who treat YouTube as a research corpus.

YouTube Splitter — Chunked Transcription preview

The Problem

A 90-minute YouTube interview is full of value and impossible to mine. You can scrub. You can read the auto-generated captions and get a 70%-accurate wall of text with no structure. What you actually want is the part where they answered question 3, transcribed cleanly, with a deep link back to the timestamp.

YouTube Splitter was the tool for that — give it a URL, get back a clean chunked transcript with timestamps you can actually navigate.

The Architecture

Five components, each doing the thing it’s best at:

ComponentStack
FrontendNext.js + Clerk + Tailwind
APINode + Express
Video processingModal (serverless Python) + FFmpeg
TranscriptionFaster Whisper
StorageS3 with presigned URLs
AuthClerk

Modal handles the variable load gracefully — a one-hour video and a five-minute video share the same code path but only the heavy one spins up GPU. Faster Whisper is the cost-quality sweet spot for English transcription; it’s substantially cheaper than the OpenAI API at comparable accuracy on long-form content.

The Trick

The interesting design call was the chunk schema. Videos got split on configurable time boundaries, not on detected speech or scene change. Two reasons:

  1. Speech-detected chunks are great for podcasts and terrible for interview videos with cross-talk and laughter.
  2. Time-based chunks make the timestamp math trivial — chunk 4 of a 5-minute-chunk video starts at 15:00, every time.

A scene-detected version was on the roadmap. Time-based got the tool out the door, and the operator (me) never missed the upgrade.

What I Learned

This was the project that taught me that serverless GPU is fine when the load curve is bursty and impossible to predict. Modal made the entire video-processing tier essentially free at the volume I was running. The same architecture choice surfaces in Studio — the video pipeline I run today is structurally a descendant of this.