# export GEMINI_API_KEY='your-api-key'
assert os.environ.get("GEMINI_API_KEY"), "Please set GEMINI_API_KEY environment variable"gem
This notebook provides a minimal interface to Google’s Gemini API. The goal is to make it dead simple to:
- Generate text with just a prompt
- Analyze files (PDFs, images, MP4 videos)
- Process videos (YouTube URLs or local MP4 files)
All through a single gem() function that just works.
Setup
First, make sure you have your Gemini API key set:
Building blocks
Let’s start with the simple helper functions that make everything work.
Client creation
We need a Gemini client to talk to the API:
Video upload
upload_file
def upload_file(
pth
):
upload_file_async
def upload_file_async(
pth
):
Async wrapper around upload_file
myfile = upload_file("_videos/test_video.mp4")
assert myfile.state == 'ACTIVE' |----------------------------------------| 0.00% [0/30 00:00<?]
Converting attachments to Parts
Gemini expects different types of content (files, URLs) to be wrapped in “Parts”. This helper handles that conversion:
_part = _make_part('_videos/test_video.mp4')
_part |----------------------------------------| 0.00% [0/30 00:00<?]
Part(
file_data=FileData(
file_uri='https://generativelanguage.googleapis.com/v1beta/files/9as0r29gc45g',
mime_type='video/mp4'
)
)
The main interface
Now we can build our main gem() function that handles all use cases:
gem
def gem(
prompt, # Text prompt
o:NoneType=None, # Optional file/URL attachment or list of attachments
model:str='gemini-2.5-flash', thinking:int=-1, search:bool=False
):
Generate content with Gemini
gem_async
def gem_async(
prompt, o:NoneType=None, model:str='gemini-2.5-flash', thinking:int=-1, search:bool=False, stream:bool=False
):
Async wrapper around gem using background threads. Set stream=True for an async iterator.
await gem_async(
"Summarize this markdown file in one sentence.",
"_test_files/sample.md"
)'This markdown file is a test document demonstrating basic formatting and evaluating text/markdown MIME type support.'
Async interface
Need non-blocking calls? Use gem_async (plus helpers like upload_file_async) to run the same flow via asyncio.to_thread, so it plays nicely inside notebooks and event-loop frameworks.
Streaming (async only)
Pass stream=True to gem_async to get an async iterator. Await the call once to obtain the stream, then iterate over it to consume chunks as Gemini sends them.
stream = await gem_async(
"Write a detailed synopsis followed by a bullet summary of this video.",
"https://youtu.be/1x3k0V2IITo",
stream=True
)
async for chunk in stream:
print(chunk, end="")The video features Antoine Chaffin, a Research Engineer at LightOn, discussing the limitations of single vector search and introducing multi-vector models, also known as late interaction models, as a superior alternative, especially for modern RAG (Retrieval-Augmented Generation) pipelines.
Chaffin begins by introducing himself and his background, which includes a PhD in multimodal misinformation detection, where he studied information retrieval and generative models. At LightOn, he focuses on information retrieval, particularly encoder models and late interaction, co-creating the ModernBERT encoder and the PyLate library. He also works on OCR-free RAG pipelines and visual rerankers.
He then dives into the core topic, explaining dense (single) vector search. This method involves feeding a query and documents into a transformer model (like ModernBERT) to create contextualized vector representations for each token. These token vectors are then "pooled" into a single vector (using max, mean, or first token pooling), and the cosine similarity between the query's single vector and the document's single vector determines relevance. While widely adopted for RAG pipelines due to their performance, wide availability, and ease of deployment with vector DBs, Chaffin highlights their limitations.
Chaffin uses the MTEB leaderboard and the BEIR benchmark to illustrate the challenges. He explains that while BEIR was designed for heterogeneous zero-shot evaluation across domains, its widespread adoption led to "Goodhart's Law" – when a measure becomes a target, it ceases to be a good measure. Models became overfit to BEIR, performing poorly in actual use cases. He stresses the importance of running custom evaluations.
A major flaw of dense models, according to Chaffin, is the "pooling" operation. Compressing all token information into a single vector leads to selective information encoding, where the model prioritizes information relevant to its training domain (e.g., actors in movie reviews) and ignores other aspects (e.g., plot, music). This results in poor out-of-domain performance, difficulty with longer contexts (as more information needs compression), and challenges with reasoning-intensive retrieval, as the model learns only one specific notion of similarity. In contrast, simpler lexical methods like BM25 often outperform dense models in these challenging scenarios, precisely because they avoid compression.
This leads to the introduction of late interaction (multi-vector) models. Instead of pooling all token vectors into one, late interaction models retain all token vectors. Similarity is computed by summing the maximum similarity of each query token vector with any document token vector (MaxSim). This approach avoids information compression, allowing for more nuanced matching. The training process becomes less noisy, as updates modify only relevant document tokens. This allows for a "soft lexical matching" fallback, akin to BM25 but leveraging deep learning.
Empirically, late interaction models demonstrate strong out-of-domain performance, often surpassing in-domain dense models. Chaffin cites examples where GTE-ModernCoBERT, a late interaction model, significantly outperforms single-vector models on long-context and reasoning-intensive benchmarks, even when trained on shorter contexts or fewer parameters. This superior performance is attributed directly to the late interaction mechanism. An additional benefit is interpretability: by observing which token vectors match, granular insights into document relevance are possible, aiding debugging and user understanding in RAG applications.
Chaffin addresses why dense models are still mainstream:
1. **Storing cost:** Storing multiple vectors is more expensive, though techniques like adapted indexes, quantization, and footprint reduction are making this more manageable.
2. **VectorDB support:** Historically, vector DB providers didn't natively support late interaction models, but most now do.
3. **Lack of accessible tools:** Sentence-transformers has been a big reason for the popularity of dense models, and a similar accessible library for late interaction was missing.
To solve the third point, LightOn developed **PyLate**, a library that extends Sentence-Transformers for multi-vector models. PyLate supports all base models, allows efficient multi-GPU training, integrates into the Hugging Face ecosystem for easy sharing and model card creation, and maintains a familiar syntax. This makes training, experimenting, and evaluating late interaction models much more accessible.
Finally, Chaffin discusses future avenues:
* **Reducing cost:** Further research into optimal trade-offs between storage size and performance for multi-vector indexes.
* **Applying to other modalities:** Extending late interaction to modalities beyond text, such as image-text matching (ColBERT, GINA) and audio.
* **Better similarity functions:** Improving upon the naive MaxSim operator with learnable scoring functions for enhanced relevance.
He concludes by encouraging viewers to try out PyLate and existing late interaction models on their data to experience their benefits firsthand.
---
### Bullet Summary:
* **Speaker:** Antoine Chaffin, Research Engineer at LightOn.
* **Topic:** Limitations of single vector search and the advantages of multi-vector (late interaction) models for RAG pipelines.
**Antoine Chaffin's Background:**
* PhD in multimodal misinformation detection.
* Focuses on information retrieval, encoder models, and late interaction at LightOn.
* Co-creator of ModernBERT encoder and PyLate library.
* Works on OCR-free RAG pipelines and visual rerankers.
**Dense (Single) Vector Search:**
* **Process:** Query/documents fed to transformer (e.g., ModernBERT) -> contextualized token vectors -> pooled into a single vector -> cosine similarity computed.
* **Popularity:** Good out-of-the-box performance, wide model availability, easy deployment with vector DBs for RAG.
* **Limitations:**
* **Overfitting (Goodhart's Law):** Benchmarks like BEIR led to models overfit to specific domains, performing poorly in real-world use cases. Emphasizes custom evaluations.
* **Pooling is an Intrinsic Flaw:** Compressing all tokens into a single vector leads to selective information encoding.
* Model focuses on information deemed "helpful" during training (e.g., actors in movie reviews), ignoring other critical details (plot, music).
* Struggles with out-of-domain contexts, longer documents (more compression needed), and reasoning-intensive retrieval (learns only one notion of similarity).
* **BM25 Comparison:** Simple lexical methods like BM25 often outperform dense models in challenging out-of-domain or reasoning-intensive tasks because they don't compress information, even though they lack semantic understanding.
**Late Interaction (Multi-Vector) Models:**
* **Process:** Query/documents fed to transformer -> contextualized token vectors retained (no pooling) -> similarity computed by summing the maximum similarity of each query token vector with any document token vector (MaxSim).
* **Advantages:**
* **No Information Compression:** Retains all token information, allowing for more nuanced and comprehensive matching.
* **Improved Generalization:** Strong out-of-domain performance, often surpassing in-domain dense models.
* **Long-Context Handling:** Better performance with longer documents due to lack of compression.
* **Reasoning-Intensive Retrieval:** Outperforms dense models, even much larger ones, on tasks requiring reasoning by providing a "soft lexical matching" fallback.
* **Interpretability:** Granular matching allows identifying specific matching sub-chunks within documents, aiding debugging and user experience.
**Why Dense is Still Mainstream (and solutions):**
* **Storage Cost:** Storing multiple vectors requires more space. Solutions: Adapted indexes, quantization, footprint reduction techniques.
* **VectorDB Support:** Historically, many vector DBs didn't natively support multi-vector search. Now, most major providers (Vespa, Weaviate, Qdrant, LanceDB, Milvus, Elastic) do.
* **Lack of Accessible Tools:** Sentence-Transformers made dense models easy; a similar tool for multi-vector was needed.
**PyLate Library:**
* **Purpose:** Extends Sentence-Transformers to support late interaction (multi-vector) models.
* **Features:**
* Supports all base transformer models.
* Efficient training (multi-GPU, FP16/BF16, checkpointing, W&B integration).
* Seamless integration with the Hugging Face ecosystem.
* Familiar syntax, allowing easy migration from Sentence-Transformers boilerplate.
* Built-in efficient PLAID index for fast retrieval.
* Helper functions for evaluation metrics using the popular RAG library.
* Supports standard queries/documents/QRELs format (MTEB, TREC datasets).
**Future Avenues:**
* **Cost Reduction:** Optimize index footprint for multi-vector models to find the optimal trade-off between performance and storage.
* **Multi-Modality Application:** Extend late interaction to image-text (ColBERT, GINA) and audio-text retrieval.
* **Improved Similarity Functions:** Develop more sophisticated, learnable scoring functions beyond the current naive MaxSim.
**Conclusion:**
* Late interaction models overcome single vector search limitations, making them well-suited for modern real-world use cases (out-of-domain, long context, reasoning-intensive).
* PyLate provides an easy-to-use ecosystem for training, experimenting, and evaluating these powerful models.
* Encourages users to try out existing late interaction models and train their own on custom data.
Examples
One function handles everything: - Just text? Pass a prompt. - Have a file? Pass it as the second argument. - Got a YouTube URL? Same thing.
Let’s test it out:
Text generation
The simplest case - just generate some text:
gem("Write a haiku about Python programming")'Simple, readable code,\nIndented, clean, logic flows,\nPower in each line.'
await gem_async("Write a haiku about Python programming")'Clear, simple code,\nLike a serpent, logic flows,\nIdeas come to life.'
Video analysis
Perfect for creating YouTube chapters or summaries:
prompt = "5 word summary of this video."
gem(prompt, "https://youtu.be/1x3k0V2IITo")'This video explains that traditional single vector search (dense models) compress token information, leading to limitations in out-of-domain and long-context scenarios. The speaker introduces "late interaction" (multi-vector) models as a solution, which keep all token vectors and use MaxSim for similarity, providing better generalization and interpretability. They highlight the PyLate library for easier implementation of these models, noting improved performance in reasoning-intensive and long-context retrieval tasks compared to dense models.\n\nHere\'s a 5-word summary: **Multi-vector search outperforms single-vector.**'
await gem_async(prompt, "https://youtu.be/1x3k0V2IITo")'The speaker, Antoine Chaffin, explains the limitations of single vector search in information retrieval due to pooling operations that compress token information. He then introduces "late interaction" (multi-vector) models as a solution, which avoid pooling and use a token-level similarity operation (MaxSim) to retain all information. This approach improves performance, especially for out-of-domain, long-context, and reasoning-intensive retrieval tasks, even outperforming larger single-vector models. He also highlights PyLate, a library for training and evaluating these multi-vector models, and discusses future research avenues like reducing storage costs and applying late interaction to other modalities.'
Local MP4 Video Analysis
You can also analyze local MP4 video files:
# Example with local MP4 file (if you have one)
gem("Summarize this video in 3 sentences.", "_videos/test_video.mp4") |----------------------------------------| 0.00% [0/30 00:00<?]
'A bald man with glasses introduces the video as a "super short test recording." He then proceeds to recite the numbers "1 2 3 4 5 6." The man concludes the test by stating his name as "Hamil Hussain."'
await gem_async("Summarize this video in 3 sentences.", "_videos/test_video.mp4") |----------------------------------------| 0.00% [0/30 00:00<?]
'The video features a man with a bald head and glasses conducting a short test recording. During the test, he recites the numbers one through six. He concludes the recording by stating his name as Hamil Hussein.'
File analysis
Great for extracting information from PDFs or images:
gem("3 sentence summary of this presentation.", "NewFrontiersInIR.pdf")'This presentation explores new frontiers in Information Retrieval (IR) by enabling models to follow complex instructions and perform reasoning, akin to large language models (LLMs). It introduces two main models: "Promptriever," a fast bi-encoder trained with synthetic instructions for promptable retrieval, and "Rank1," a powerful but slower cross-encoder that utilizes test-time compute for deeper reasoning. These instruction-trained retrievers significantly enhance search capabilities by unlocking new types of natural language queries, moving beyond simple keywords, and achieving higher accuracy in nuanced retrieval tasks.'
await gem_async("3 sentence summary of this presentation.", "NewFrontiersInIR.pdf")'This presentation explores "New Frontiers in IR: Instruction Following and Reasoning," arguing that traditional search, even with LLM wrappers, hasn\'t fully evolved to handle complex user instructions. It introduces two main models: "Promptriever," an instruction-trained bi-encoder that can be prompted like a language model for efficient instruction following, and "Rank1," a cross-encoder that uses test-time compute for robust, reasoning-based reranking. Both models demonstrate significant improvements over existing methods across various tasks, unlocking new types of natural language queries and enhancing the ability to retrieve nuanced and highly relevant documents.'
gem("What's in this image?", "anton.png")'This image is a promotional thumbnail, likely for a video or article, focusing on a technical topic, most probably related to AI/Machine Learning, specifically "vectors" and "RAG" (Retrieval-Augmented Generation).\n\nHere\'s a breakdown of the elements:\n\n* **Background:** A dark, solid blue-black color, providing a high contrast for the text and graphics.\n* **Person:** In the bottom left, a young man with light skin and brown hair is visible from the chest up, smiling broadly and looking slightly to his right (viewer\'s left). He is wearing a white V-neck t-shirt.\n* **Emoji:** Directly above the man\'s head, slightly to the left, is a yellow "sad face" or "worried face" emoji, creating a visual contrast with his smile.\n* **Text:**\n * In the top left, in large, bold white letters: "Single Vector?"\n * Below that, overlapping with the man\'s head and the emoji, in large, bold yellow letters: "YOU\'RE MISSING OUT"\n* **Graphics (Network/Diagram):**\n * **Top Right:** A network of glowing blue circles (nodes) connected by thin blue lines, resembling a neural network or a graph.\n * **Bottom Right:** A series of smaller glowing blue nodes connected by lines, leading to a central node, which then branches out into multiple lines converging into a rectangular box.\n * **Arrows:** Arrows indicate the flow of information:\n * An arrow from the main top-right network points downwards.\n * Arrows from the bottom-right series of nodes point towards the rectangular box.\n * **RAG Box:** A dark blue rectangular box with rounded corners and a lighter blue border. Inside, in bold white letters, it says "RAG".\n\n**Overall Message:** The image uses visual contrast (smiling person with a sad emoji) and direct text ("YOU\'RE MISSING OUT") to suggest that relying on a "Single Vector" approach might be insufficient or outdated, and that integrating with a "RAG" system (represented by the interconnected networks) is a better, more advanced method. It aims to create curiosity and convey a sense of urgency about adopting this "RAG" approach in the context of vector-based data processing or AI.'
await gem_async("What's in this image?", "anton.png")'This image is a digital graphic, likely a thumbnail for a video or article, set against a dark blue or black background.\n\nHere\'s a breakdown of its contents:\n\n1. **Text:**\n * At the top left, in large white font, is the phrase: "Single Vector?"\n * Below that, in large yellow font, are the stacked phrases: "YOU\'RE MISSING OUT".\n\n2. **Emoji & Person:**\n * Above the word "YOU\'RE," there is a yellow sad or worried face emoji.\n * To the left and slightly below the text and emoji, a young man with brown hair and a white V-neck shirt is smiling broadly, looking directly at the viewer.\n\n3. **Diagram/Graphic:**\n * On the right side of the image, there is a glowing blue, abstract diagram resembling a network or graph. It consists of multiple interconnected blue circles (nodes) and lines (edges).\n * An arrow from a cluster of these nodes points downwards to a rectangular box.\n * This box is dark blue with rounded corners and contains the white capital letters: "RAG".\n * There\'s also a smaller set of interconnected blue dots and lines originating from the bottom left, with an arrow feeding into the larger network structure.\n\nThe overall theme, combining the text "Single Vector? YOU\'RE MISSING OUT" with the RAG (Retrieval-Augmented Generation) diagram, strongly suggests a topic related to artificial intelligence, machine learning, or large language models, likely discussing advanced techniques beyond simple vector representations or emphasizing the importance of RAG.'
Text file analysis
Works with common text formats like .txt, .vtt, .md:
gem("What type of file is this?", "_test_files/sample.txt")'This is a **plain text file**.\n\nIt even mentions its MIME type within the content: "text/plain".'
await gem_async("What type of file is this?", "_test_files/sample.txt")'This is a **plain text file**.\n\nMore specifically, it aligns with the `text/plain` MIME type, as it explicitly states and its content is simple human-readable characters without any special formatting, encoding, or binary data.'
gem("How many subtitle entries are in this file?", "_test_files/sample.vtt")'There are **2** subtitle entries in this VTT file.'
gem("List the items in this markdown file.", "_test_files/sample.md")'Based on the markdown file provided, the items listed are:\n\n1. Item 1\n2. Item 2'
Change Model
You can also control the model and thinking time:
gem("What is Hamel Husain's current job?", model="gemini-2.5-pro")"Based on his public profiles, Hamel Husain's current job is **Head of Data Science & Machine Learning at Argonaut**.\n\nHe is also well-known for his previous role as a Principal Machine Learning Scientist at **GitHub**, where he created popular open-source tools like `nbdev`, `fastpages`, and `ghapi`."
Grounded Search
As you can see, grounded search is required to get things right sometimes!
gem("What is Hamel Husain's current job?.", search=True)'Hamel Husain is currently working as an independent consultant, assisting companies in building and operationalizing AI products, with a particular focus on Large Language Models (LLMs). He also co-teaches a course titled "AI Evals for Engineers & PMs."\n\nPreviously, Hamel Husain held the position of Staff Machine Learning Engineer at GitHub. He has over 20 years of experience as a machine learning engineer, having worked with companies like Airbnb and GitHub, where his work included early LLM research utilized by OpenAI for code understanding. He has also contributed to numerous popular open-source machine learning tools.'
await gem_async("What is Hamel Husain's current job?.", search=True)'Hamel Husain is currently an independent consultant, assisting companies with building and operationalizing AI products, especially those involving Large Language Models (LLMs). He also works as an AI consultant at Parlance Labs and co-teaches a course titled "AI Evals for Engineers & PMs".\n\nPreviously, Hamel Husain held the position of Staff Machine Learning Engineer at GitHub. He has over 20 years of experience as a machine learning engineer and has contributed to numerous popular open-source machine learning tools.'
Multiple Attachments
You can analyze multiple files/URLs at once by passing a list:
prompt = "Is this PDF and YouTube video related or are they different talks? Answer with very short yes/no answer."
gem(prompt, ["https://youtu.be/Trps2swgeOg?si=yK7CO0Zk4E1rfp6s", "NewFrontiersInIR.pdf"])'No'
await gem_async(prompt, ["https://youtu.be/YB3b-wPbSH8?si=WI0LqflY5SYIsRz9", "NewFrontiersInIR.pdf"])'Yes.'
gem("What do these slides and this video have in common in terms of content/subject matter if at all? Provide a 1 sentence summary of each.", ["NewFrontiersInIR.pdf", "_videos/test_video.mp4"]) |█---------------------------------------| 3.33% [1/30 00:10<04:53]
'The video is a brief test recording where a man introduces himself and recites numbers to check audio quality.\n\nThe slides present research on "New Frontiers in IR," exploring how to make information retrieval systems understand and follow natural language instructions and perform reasoning, much like large language models, using methods like "Promptriever" and "Rank1."\n\nThese two pieces of content have **no commonality** in terms of subject matter or content. The video is a personal audio test, while the slides are a technical presentation on advanced information retrieval.'
await gem_async("What do these slides and this video have in common in terms of content/subject matter if at all? Provide a 1 sentence summary of each.", ["NewFrontiersInIR.pdf", "_videos/test_video.mp4"]) |----------------------------------------| 0.00% [0/30 00:00<?]
'The video and the slides have **no commonality** in terms of content or subject matter.\n\n* **Video Summary:** The video shows a man conducting a brief audio test, reciting numbers and his name.\n* **Slideshow Summary:** The slideshow presents "Promptriever" and "Rank1" as novel information retrieval models capable of instruction following and complex reasoning, similar to large language models.'