yt

Utilities for Content Creation From YouTube and Local MP4 Videos

This notebook provides utilities for working with video content:

Key functions: - yt_chapters(): Generate video summaries and chapter timestamps (works with both YouTube URLs and MP4 files) - transcribe(): Get transcripts from YouTube or transcribe local MP4 files with timestamps

YouTube/Video Chapter Creation

Automate chapter creation + description for YouTube videos or local MP4 files


source

yt_chapters

 yt_chapters (url_or_path)

Generate YouTube Summary and Chapters from a video (YouTube URL or local MP4).

This is what it looks like for Antoine’s Late Interaction Talk:

chp = yt_chapters('context_rot/context_rot.mp4')
print(chp)
30.00% [9/30 01:31<03:32]
Kelly Hong from Chroma presents the "Context Rot" technical report, demonstrating that Large Language Models' performance is not uniform across input lengths. She reveals through several experiments that performance degrades significantly as input tokens increase, especially when tasks involve semantic ambiguity or distracting information, highlighting the critical need for thoughtful context engineering over simply expanding context windows.

00:00 - Introduction to Kelly Hong and the "Context Rot" Report
00:32 - What is Context Rot?
01:47 - The Rise of Long Context Windows in Frontier Models
02:07 - The Common Assumption: More Context is Always Better
03:33 - Explaining the Needle in a Haystack (NIAH) Benchmark
04:49 - Experiment 1: Adding Ambiguity (Semantic vs. Lexical Matching)
06:10 - Q&A: Clarification on High vs. Low Performance Model Graphs
06:54 - Q&A: Was the Original Needle in a Haystack Benchmark Pointless?
08:08 - Experiment 1: Implications for Real-World Applications
09:39 - Experiment 2: Adding Distractors
11:43 - Experiment 2: Implications in Domain-Specific Contexts
12:55 - Model Hallucinations and Failure Modes
14:12 - Experiment 3: Shuffling Haystack Content
15:34 - Surprising Results of Shuffling the Haystack
16:03 - Q&A: What About Needles that Logically Fit the Context?
17:54 - Experiment 4: Conversational Memory
19:20 - Experiment 5: Text Replication
20:15 - Key Takeaways
21:07 - Context Engineering Example: Coding Agent
22:47 - Further Reading and Resources
23:14 - Q&A: Is There One Model that Consistently Resists Context Rot?
24:54 - Q&A: The Term "Context Rot" and the "RAG is Dead" Debate
27:32 - Q&A: Advice for Detecting and Managing Context Rot in AI Applications
29:05 - Q&A: Does the U-Shaped Retrieval Curve (Beginning/End Priority) Still Hold?
30:44 - Conclusion
chp = yt_chapters("https://youtu.be/1x3k0V2IITo")
print(chp)
In this presentation, Antoine Chaffin from LightOn explains the intrinsic limitations of single-vector search models, particularly their struggle with out-of-domain generalization and long contexts due to information loss from pooling. He introduces late-interaction (multi-vector) models as a superior alternative that retains all token-level information and presents his PyLate library, designed to make these powerful models accessible and easy to train.

00:00 - Introduction
00:32 - About the Speaker: Antoine Chaffin
01:40 - How Dense (Single) Vector Search Works
03:07 - Why Single Vector Search is the Go-To for RAG
03:54 - Performance Evaluation with Leaderboards (MTEB)
04:17 - The BEIR Benchmark and Goodhart's Law
05:36 - Limitations Not Captured by Benchmarks: Long Context
06:32 - Limitations Not Captured by Benchmarks: Reasoning-Intensive Retrieval
08:24 - The Intrinsic Flaw of Dense Models: Pooling and Information Compression
10:42 - Why BM25 Remains Competitive
11:32 - Replacing Pooling with Late Interaction (Multi-Vector) Models
12:17 - Why Late Interaction is Better Than Just a Larger Single Vector
13:51 - Evidence of Late Interaction's Superior Performance
17:42 - Why Aren't Late Interaction Models Mainstream?
18:43 - Introducing PyLate: Extending Sentence Transformers for Multi-Vector Models
21:28 - Evaluating Models with PyLate
22:49 - Future Research Avenues for Late Interaction
24:36 - Conclusion and Key Takeaways
25:52 - Q&A Start
25:57 - Q1: What are the latency tradeoffs of late interaction vs. dense models?
26:42 - Q2: Why are late interaction models not yet mainstream despite their advantages?
31:00 - Q3: Does the performance gap hold when comparing fine-tuned late interaction models to fine-tuned dense models?
33:20 - Q4: How easy is it to fine-tune with PyLate and are there common pitfalls?
# Works with local MP4 files too
# chp_local = yt_chapters("path/to/your/video.mp4")
# print(chp_local)

Fetch YouTube Transcript or Transcribe Local MP4

Fetch the YouTube transcript from public videos or transcribe local MP4 files using OpenAI Whisper.


source

transcribe

 transcribe (url_or_path, seconds_only=False)

Download YouTube transcript or transcribe local video.


source

transcribe_local_video

 transcribe_local_video (file_path, seconds_only=False)

Transcribe local MP4 video using Whisper.

t = transcribe("https://youtu.be/1x3k0V2IITo")
print(t[:500])
[00:00:00] Hello everyone, my name is Chapan and I
[00:00:02] am a research engineer at Leighton and
[00:00:05] today I will detail some of the limits
[00:00:08] of single vector search that have been
[00:00:10] highlighted by recent usages and
[00:00:13] evaluations and then I will introduce
[00:00:16] multi vector models also known as late
[00:00:18] interaction models and how they can
[00:00:21] overcome this and to finish I will
[00:00:24] briefly present the pilot library that
[00:00:26] al

Local MP4 Transcription

You can also transcribe local MP4 files using OpenAI Whisper:

Note: Requires ffmpeg and openai-whisper installed: - macOS: brew install ffmpeg - Ubuntu: apt-get install ffmpeg - pip install openai-whisper

t_local = transcribe("_videos/test_video.mp4")
print(t_local[:500])
100%|█████████████████████████████████████| 1.51G/1.51G [00:34<00:00, 47.3MiB/s]
/Users/hamel/.venv/lib/python3.12/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
[00:00:00] Hello, this is a super short test recording where I'm going to
[00:00:04] say 1, 2, 3, 4, 5, 6 and say my name, Hamil Hussain.