gem

Simple utilities for working with Google’s Gemini API

This notebook provides a minimal interface to Google’s Gemini API. The goal is to make it dead simple to:

Generate text with just a prompt
Analyze files (PDFs, images)
Process videos (like YouTube)

All through a single gem() function that just works.

Setup

First, make sure you have your Gemini API key set:

# export GEMINI_API_KEY='your-api-key'
assert os.environ.get("GEMINI_API_KEY"), "Please set GEMINI_API_KEY environment variable"

Building blocks

Let’s start with the simple helper functions that make everything work.

Client creation

We need a Gemini client to talk to the API:

Converting attachments to Parts

Gemini expects different types of content (files, URLs) to be wrapped in “Parts”. This helper handles that conversion:

The main interface

Now we can build our main gem() function that handles all use cases:

source

gem

 gem (prompt, o=None, model='gemini-2.5-flash', thinking=-1, search=False)

Generate content with Gemini

	Type	Default	Details
prompt			Text prompt
o	NoneType	None	Optional file/URL attachment or list of attachments
model	str	gemini-2.5-flash
thinking	int	-1
search	bool	False

Examples

One function handles everything: - Just text? Pass a prompt. - Have a file? Pass it as the second argument. - Got a YouTube URL? Same thing.

Let’s test it out:

Text generation

The simplest case - just generate some text:

gem("Write a haiku about Python programming")

'Clean, clear code unfolds,\nIndents guide the powerful flow,\nProblems solved with ease.'

Video analysis

Perfect for creating YouTube chapters or summaries:

prompt = "5 word summary of this video."
gem(prompt, "https://youtu.be/1x3k0V2IITo")

'Late interaction beats single vector.'

File analysis

Great for extracting information from PDFs or images:

gem("3 sentence summary of this presentation.", "NewFrontiersInIR.pdf")

'This presentation introduces new frontiers in Information Retrieval (IR), focusing on instruction following and reasoning capabilities, much like Large Language Models (LLMs). It presents two key models: Promptriever, a fast bi-encoder trained to follow natural language instructions for retrieval, and Rank1, a slower cross-encoder capable of complex, test-time reasoning for judging document relevance. These "promptable" and "reasoning" retrievers significantly enhance search performance, unlock new types of queries, and can even uncover previously overlooked relevant documents.'

gem("What's in this image?", "anton.png")

'This image is a digital graphic, likely a thumbnail for a video or presentation, set against a dark blue or black background.\n\nHere\'s a detailed breakdown of its contents:\n\n1.  **Person:** On the lower-left side, a young Caucasian man with short brown hair is smiling broadly, showing his teeth. He is wearing a light-colored (possibly white) V-neck shirt, with his head and upper chest visible.\n2.  **Emoji:** Above and slightly to the left of the man\'s head, there is a yellow "sad face" emoji with downturned eyes and mouth.\n3.  **Text:**\n    *   In the upper-left, white text reads: "Single Vector?"\n    *   Below this, and slightly to the right of the emoji, large, bold yellow text is stacked vertically:\n        *   "YOU\'RE"\n        *   "MISSING"\n        *   "OUT"\n    *   Combined, the main text reads: "Single Vector? YOU\'RE MISSING OUT"\n4.  **Diagram/Network:** On the right side, there\'s an abstract glowing blue diagram resembling a neural network or a data flow graph. It features multiple interconnected blue circular nodes (or "neurons") linked by glowing blue lines.\n    *   There appears to be an upper cluster of nodes and a lower cluster.\n    *   An arrow points upwards from the lower cluster towards the upper cluster.\n    *   Another arrow points downwards from the upper cluster towards a dark blue rectangular box with rounded corners.\n5.  **"RAG" Box:** Inside the blue rectangular box on the lower-right, the white, bold capital letters "RAG" are prominently displayed.\n\nThe overall impression is that of a promotional image, possibly related to technology, artificial intelligence, or data processing, with the text suggesting a warning or a missed opportunity for those who stick to "single vector" approaches, implying that the RAG (Retrieval-Augmented Generation) concept shown in the diagram is a superior method.'

Change Model

You can also control the model and thinking time:

gem("What is Hamel Husain's current job?", model="gemini-2.5-pro")

"Based on his public profiles and professional presence, Hamel Husain's current job is **Co-founder and CEO** of **Gantry**.\n\nGantry is a company that provides an AI observability platform designed to help teams monitor, analyze, and improve their machine learning models in production.\n\nBefore co-founding Gantry in 2022, he was well-known for his role as the Head of Machine Learning at GitHub."

Grounded Search

As you can see, grounded search is required to get things right sometimes!

gem("What is Hamel Husain's current job?.", search=True)

"\nthought\nThe search results indicate that Hamel Husain is currently an independent consultant specializing in AI and machine learning, particularly in operationalizing Large Language Models (LLMs). While he was previously a Staff Machine Learning Engineer at GitHub and is still listed as such in some contexts, more recent information from March and July 2025 explicitly states his role as an independent consultant. This suggests his independent consulting is his current primary job. Therefore, I have sufficient information to answer the user's request.Hamel Husain is currently an independent consultant, specializing in helping companies build, evaluate, and operationalize AI-powered systems and Large Language Models (LLMs). He focuses on making AI more reliable, understandable, and actionable.\n\nPreviously, he held the position of Staff Machine Learning Engineer at GitHub, where he was involved in the design and development of software engineering, machine learning, and developer tools, and led engineering teams. His work at GitHub included leading research efforts on projects like CodeSearchNet, which was a precursor to GitHub Copilot. He has also worked at companies such as Airbnb, DataRobot, and Accenture.Hamel Husain is currently an independent consultant, running Parlance Labs, where he helps companies build AI products and operationalize Large Language Models (LLMs). He also serves on the R&D team at AnswerAI and scouts for Bain Capital.\n\nPreviously, he was a Staff Machine Learning Engineer at GitHub, where he led research efforts on projects such as CodeSearchNet, a precursor to GitHub Copilot. His extensive experience in machine learning spans over 20 years, with past roles at companies like Airbnb, DataRobot, and Accenture."

Multiple Attachments

You can analyze multiple files/URLs at once by passing a list:

prompt = "Is this PDF and YouTube video related or are they different talks? Answer with very short yes/no answer."
gem(prompt, ["https://youtu.be/Trps2swgeOg?si=yK7CO0Zk4E1rfp6s", "NewFrontiersInIR.pdf"])

'No.'

gem(prompt, ["https://youtu.be/YB3b-wPbSH8?si=WI0LqflY5SYIsRz9", "NewFrontiersInIR.pdf"])

'Yes.'

Shortcuts

Functions to help you do common tasks

source

yt_chapters

 yt_chapters (link)

Generate YoutTube Summary and Chapters From A Public Video.

This is what it looks like for Antoine’s Late Interaction Talk:

chp = yt_chapters("https://youtu.be/1x3k0V2IITo")
print(chp)

Antoine Chaffin explains the inherent limitations of single vector search, such as information loss from pooling and poor performance in out-of-domain and long-context scenarios. He then introduces late interaction (multi-vector) models as a superior alternative that avoids these pitfalls and presents the PyLate library to make training and evaluating these powerful models more accessible.

00:00 - Going Further: Late Interaction Beats Single Vector Limits
00:32 - About Antoine Chaffin
01:40 - Explaining Dense (Single) Vector Search
03:08 - Why Single Vector Search is the Go-To for RAG
03:54 - Performance Evaluation and the MTEB Leaderboard
04:17 - BEIR: A School Case of Goodhart's Law
05:36 - Limitations Beyond Standard Benchmarks
08:24 - Pooling: The Intrinsic Flaw of Dense Models
08:41 - How Pooling Creates Problems in Production
10:42 - The Advantage of BM25
11:32 - Replacing Pooling with Late Interaction
12:17 - Why Not Just Use a Bigger Single Vector?
13:51 - Performance Comparison: Late Interaction vs. Dense Models
16:48 - Interpretability: A Nice Little Bonus
17:42 - Why Are People Still Using Dense Models?
18:43 - PyLate: Extending Sentence Transformers for Multi-Vector Models
21:28 - Evaluating Models with PyLate
22:49 - Future Avenues for Research
24:36 - Conclusion and Resources
25:51 - Q&A: Latency of Late Interaction vs. Dense Vector
31:00 - Q&A: Fine-tuning Comparison
33:20 - Q&A: Tips for Fine-tuning with PyLate