YouTube Channel Transcription Pipeline

A complete pipeline for transcribing an entire YouTube channel with speaker diarization (identifying who said what), then converting the transcripts into clean, categorized HTML pages.

What This Does

Downloads audio from all videos on a YouTube channel
Transcribes speech to text using OpenAI's Whisper model (locally)
Identifies speakers using voice diarization (pyannote) and voice matching (Resemblyzer)
Fixes speaker names and extracts guest information from video metadata
Generates clean HTML with YouTube embeds, descriptions, and formatted transcripts
Creates a categorized index organizing all videos by topic

Requirements

Software

Python 3.10+
ffmpeg (for audio processing)
CUDA-compatible GPU recommended (CPU works but is ~20x slower)

Python Packages

pip install openai-whisper pyannote.audio resemblyzer yt-dlp \
    python-dotenv google-api-python-client torch torchaudio

API Keys

Hugging Face Token: Required for pyannote speaker diarization model
- Get one at: https://huggingface.co/settings/tokens
- Accept the model agreement at: https://huggingface.co/pyannote/speaker-diarization
YouTube Data API Key: Required for fetching video descriptions
- Get one from Google Cloud Console

Setup

Clone the repository:

git clone https://github.com/yourusername/transcriptor.git
cd transcriptor

Install dependencies:

pip install -r requirements.txt

Configure your environment:

cp .env.example .env

Then edit .env with your API keys:

HF_TOKEN: Get from https://huggingface.co/settings/tokens (required)
YOUTUBE_API_KEY: Get from Google Cloud Console (required)
YOUTUBE_CHANNEL_URL: Your channel to transcribe (required)
REFERENCE_VIDEO_URL: A video where you're speaking (optional, for voice ID)

Create a voice profile for the host (optional but improves accuracy):

python create_voice_profile.py

This extracts voice embeddings from a reference video where the host is speaking.

How It Works

Step 1: Build Video List

The system uses yt-dlp to enumerate all videos on the channel:

ydl_opts = {
    'extract_flat': True,
    'quiet': True,
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    channel_info = ydl.extract_info(channel_url, download=False)

This creates video_list.json with video IDs, titles, and durations.

Step 2: Download and Transcribe

For each video:

Download audio as WAV using yt-dlp with rate limiting to avoid bot detection:

ydl_opts = {
    'format': 'bestaudio/best',
    'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'wav'}],
    'cookiefile': 'cookies.txt',  # Optional: helps avoid rate limiting
    'sleep_interval': 2,
    'http_headers': {'User-Agent': '...'}
}

Transcribe with Whisper:

model = whisper.load_model("large-v2")
result = model.transcribe(audio_path, language="en")

Diarize speakers with pyannote:

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=HF_TOKEN
)
diarization = pipeline(audio_path)

Match speakers to known voices using Resemblyzer embeddings:

encoder = VoiceEncoder()
embedding = encoder.embed_utterance(audio_segment)
similarity = np.dot(embedding, host_profile)

Step 3: Generate HTML

The create_html_transcripts.py script:

Reads each transcript from the output/ directory
Fetches video description from YouTube API
Extracts guest names from title patterns like:
- "Topic | Guest Name | Company"
- "Topic with Guest Name"
- "Guest Name and Jon Radoff discuss..."
Fixes name misspellings (e.g., "John Raidoff" → "Jon Radoff")
Generates SEO-friendly filenames from titles
Creates HTML with:
- YouTube video embed
- Title and description
- Transcript with bold speaker names and paragraph breaks

Step 4: Create Index

The index.html organizes videos into categories based on content analysis:

AI & Generative Technology
Game Development
Metaverse & Virtual Worlds
Web3, Blockchain & NFTs
DePIN & Decentralized Infrastructure
Creator Economy & Digital Identity

Usage

Full Pipeline (GPU recommended)

# 1. Create voice profile for the host
python create_voice_profile.py

# 2. Transcribe all videos
python transcribe_channel_gpu.py

# 3. Generate HTML files
python create_html_transcripts.py

CPU-Only (slower, ~1.5-2 hours per video)

python transcribe_channel_batch.py

Generate HTML Only (if transcripts exist)

python create_html_transcripts.py

Output Structure

transcriptor/
├── .env                          # API keys and config
├── video_list.json               # Channel video metadata
├── voice_profiles/
│   └── host_voice_profile.pkl    # Voice embedding for speaker matching
├── downloads/                    # Downloaded audio files (WAV)
├── output/                       # Raw transcripts (TXT)
│   ├── VIDEO_ID.txt
│   └── ...
└── html/                         # Final HTML output
    ├── index.html                # Categorized index page
    ├── video-title-slug.html     # Individual transcript pages
    └── ...

Performance

Hardware	Time per Video	Notes
NVIDIA L4 GPU	~5-6 minutes	Recommended
NVIDIA T4 GPU	~8-10 minutes	Good balance of cost/speed
CPU (M1 Mac)	~90-120 minutes	Works but slow

Handling YouTube Rate Limiting

YouTube may block downloads after many requests. Mitigations:

Rate limiting: 10-second delays between videos
User agent spoofing: Appear as a normal browser
Cookie authentication: Export cookies from your browser
Retry logic: Exponential backoff on failures

# Extract cookies from Chrome
yt-dlp --cookies-from-browser chrome --cookies cookies.txt ...

Customization

Adding New Name Corrections

Edit the JON_RADOFF_VARIANTS list in create_html_transcripts.py:

JON_RADOFF_VARIANTS = [
    "john radoff", "jon raidoff", "john raidoff", ...
]

Changing Categories

Edit the category sections in index.html or create a script to auto-categorize based on keywords in titles/descriptions.

Different Whisper Models

Available models (speed vs accuracy tradeoff):

tiny - Fastest, least accurate
base - Fast, decent accuracy
small - Balanced
medium - Good accuracy
large-v2 - Best accuracy (used here)

Troubleshooting

"Sign in to confirm you're not a bot"

Add longer delays between downloads
Use browser cookies: --cookies-from-browser chrome
Try at different times of day

PyTorch/pyannote version conflicts

Ensure compatible versions:

pip install torch==2.10.0 torchaudio==2.10.0 pyannote.audio==4.0.4

Out of GPU memory

Use a smaller Whisper model (medium instead of large-v2)
Process shorter audio segments
Use CPU for diarization, GPU for transcription

Technical Stack

yt-dlp: YouTube video downloading
OpenAI Whisper: State-of-the-art speech-to-text transcription (local)
pyannote.audio: Speaker diarization (who speaks when)
Resemblyzer: Speaker recognition (voice matching)
YouTube Data API v3: Channel video listing and descriptions
google-api-python-client: YouTube API access

License

See LICENSE for full details.

Dependencies

This pipeline uses the following open source libraries:

OpenAI Whisper - MIT License
pyannote.audio - MIT License
Resemblyzer - Apache 2.0
yt-dlp - Unlicense

How to Create This from Scratch with Claude Code

This entire pipeline was built using Claude Code, Anthropic's AI coding assistant. Here's how you can replicate it for your own YouTube channel without writing any code yourself.

Prerequisites

Install Claude Code CLI
Have a YouTube channel URL ready
Get API keys (Hugging Face, YouTube Data API)

Prompt Sequence

Prompt 1: Set up the project

I want to transcribe all the videos on my YouTube channel. The channel is at
https://www.youtube.com/@MyChannelName. Create a Python project that:
- Downloads all videos from the channel
- Transcribes them using Whisper (locally, not API)
- Identifies different speakers (I'm the host, guests vary per video)
- Outputs timestamped transcripts with speaker labels

I have a Hugging Face token for pyannote speaker diarization.

Prompt 2: Create voice profile for speaker identification

Create a voice profile for me so the system can identify when I'm speaking vs guests.
Use this video as a reference where I'm the main speaker:
https://www.youtube.com/watch?v=VIDEO_ID (I start speaking at 0:29)

Prompt 3: Run batch transcription

Now transcribe all the videos on my channel. Start with a few to test, then do the rest.
Skip any videos that have already been transcribed.

Prompt 4: Handle rate limiting (if needed)

YouTube is blocking downloads with "Sign in to confirm you're not a bot".
Can you fix this? We should use the YouTube API or cookies to avoid detection.

Prompt 5: Speed up with GPU (optional)

The transcription is too slow on CPU. I have access to a cloud GPU
(GCP/AWS with NVIDIA T4 or L4). Can we run the transcription there instead?

Prompt 6: Generate HTML output

Create an html subdirectory with clean HTML versions of these transcripts:
- Fix any misspellings of my name (correct: Your Name)
- Identify guests from video titles and descriptions
- Add YouTube embed at top of each page
- Put each speaker's name in bold
- Create SEO-friendly filenames from titles
- Add the video description below the embed

Prompt 7: Create categorized index

Create an index.html that organizes all videos into logical categories
based on their content. Analyze the titles and descriptions to determine
appropriate categories.

Tips for Success

Be specific about your channel - Give Claude the exact URL and any naming conventions you use in titles
Provide reference content - Point to a video where you're clearly the speaker for voice profiling
Iterate on problems - When you hit issues (rate limiting, GPU memory, etc.), just describe the error and Claude will fix it
Let Claude choose tools - Don't prescribe specific libraries; describe what you want and let Claude pick appropriate tools
Review and refine - Check the output after each step. If speaker identification is wrong or names are misspelled, tell Claude specifically what to fix

Example Customization Prompts

For different name corrections:

In the transcripts, my name appears as "John Smith" and "Jon Smyth" sometimes.
The correct spelling is "John Smith". Fix all variations.

For custom categories:

I want to categorize my videos differently. My channel covers:
- Product tutorials
- Customer interviews
- Industry news
- Behind the scenes

Re-organize the index.html with these categories.

For different output format:

Instead of HTML, I want the transcripts as:
- Markdown files
- With YAML frontmatter containing title, date, guest name
- Suitable for a Jekyll or Hugo static site

For subtitle generation:

Generate SRT subtitle files from these transcripts that I can upload back to YouTube.

Publishing to a CMS with LightCMS

Once you have your HTML transcripts generated, you can easily import them into a website using LightCMS, an AI-native content management system designed for agentic workflows.

LightCMS provides a Model Context Protocol (MCP) server with 38+ tools for managing your entire website through Claude Code or other AI assistants. Instead of manually uploading files through an admin interface, you can simply describe what you want and let the AI handle the import.

Example: Bulk Import Transcripts

With the LightCMS MCP server connected to Claude Code, use this prompt to import all your transcripts:

On my website I want you to import all of the files from /Users/yourname/transcriptor/html
into the "videos" folder with one content page per html file. Preserve the original filename
as the slug. We will use the Standard Page template, where the ingested file comprises the
content field. Make the title of each page match the title of the video mentioned inside
each page.

IMPORTANT: it's OK to inspect these files using an LLM for things like the description.
But do NOT parse the content through an LLM to generate the page; we need to preserve the
exact original content of each html page into the content field of each content item in
the content management system.

Claude will then:

Read each HTML file from your html/ directory
Extract the video title from the page content
Create a content page in LightCMS with the correct slug
Preserve the exact HTML content in the content field

This agentic approach makes the "last mile" of publishing incredibly simple—no manual copying, no form filling, just describe what you want and it happens.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
USAGE.md		USAGE.md
create_html_transcripts.py		create_html_transcripts.py
create_voice_profile.py		create_voice_profile.py
requirements.txt		requirements.txt
setup.sh		setup.sh
simple_transcribe.py		simple_transcribe.py
test_single_video.py		test_single_video.py
transcribe_channel.py		transcribe_channel.py
transcribe_channel_full.py		transcribe_channel_full.py
transcribe_channel_gpu.py		transcribe_channel_gpu.py
transcribe_with_speakers.py		transcribe_with_speakers.py
video_list.json		video_list.json
youtube_fetcher.py		youtube_fetcher.py

License

jonradoff/transcriptor

Folders and files

Latest commit

History

Repository files navigation

YouTube Channel Transcription Pipeline

What This Does

Requirements

Software

Python Packages

API Keys

Setup

How It Works

Step 1: Build Video List

Step 2: Download and Transcribe

Step 3: Generate HTML

Step 4: Create Index

Usage

Full Pipeline (GPU recommended)

CPU-Only (slower, ~1.5-2 hours per video)

Generate HTML Only (if transcripts exist)

Output Structure

Performance

Handling YouTube Rate Limiting

Customization

Adding New Name Corrections

Changing Categories

Different Whisper Models

Troubleshooting

"Sign in to confirm you're not a bot"

PyTorch/pyannote version conflicts

Out of GPU memory

Technical Stack

License

Dependencies

How to Create This from Scratch with Claude Code

Prerequisites

Prompt Sequence

Tips for Success

Example Customization Prompts

Publishing to a CMS with LightCMS

Example: Bulk Import Transcripts

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages