gem

Simple utilities for working with Google’s Gemini API

This notebook provides a minimal interface to Google’s Gemini API. The goal is to make it dead simple to:

  1. Generate text with just a prompt
  2. Analyze files (PDFs, images, MP4 videos)
  3. Process videos (YouTube URLs or local MP4 files)

All through a single gem() function that just works.

Setup

First, make sure you have your Gemini API key set:

# export GEMINI_API_KEY='your-api-key'
assert os.environ.get("GEMINI_API_KEY"), "Please set GEMINI_API_KEY environment variable"

Building blocks

Let’s start with the simple helper functions that make everything work.

Client creation

We need a Gemini client to talk to the API:

Video upload


source

upload_file

 upload_file (pth)
myfile = upload_file("_videos/test_video.mp4")
assert myfile.state == 'ACTIVE'
 |----------------------------------------| 0.00% [0/30 00:00<?]

Converting attachments to Parts

Gemini expects different types of content (files, URLs) to be wrapped in “Parts”. This helper handles that conversion:

_part = _make_part('_videos/test_video.mp4')
_part
 |----------------------------------------| 0.00% [0/30 00:00<?]
Part(
  file_data=FileData(
    file_uri='https://generativelanguage.googleapis.com/v1beta/files/y328hzqczeyj',
    mime_type='video/mp4'
  )
)

The main interface

Now we can build our main gem() function that handles all use cases:


source

gem

 gem (prompt, o=None, model='gemini-2.5-flash', thinking=-1, search=False)

Generate content with Gemini

Type Default Details
prompt Text prompt
o NoneType None Optional file/URL attachment or list of attachments
model str gemini-2.5-flash
thinking int -1
search bool False

Examples

One function handles everything: - Just text? Pass a prompt. - Have a file? Pass it as the second argument. - Got a YouTube URL? Same thing.

Let’s test it out:

Text generation

The simplest case - just generate some text:

gem("Write a haiku about Python programming")
'Clear and simple lines,\nIndentation guides the way,\nCode starts to run free.'

Video analysis

Perfect for creating YouTube chapters or summaries:

prompt = "5 word summary of this video."
gem(prompt, "https://youtu.be/1x3k0V2IITo")
'The speaker discusses the limitations of single vector search and introduces **late interaction models** (multi-vector models) as a solution. These models are shown to improve **out-of-domain generalization**, **long-context handling**, and **reasoning-intensive retrieval**. He also introduces **PyLate**, a library for training and evaluating late interaction models, which integrates well with the Hugging Face ecosystem.'

Local MP4 Video Analysis

You can also analyze local MP4 video files:

# Example with local MP4 file (if you have one)
gem("Summarize this video in 3 sentences.", "_videos/test_video.mp4")
 |----------------------------------------| 0.00% [0/30 00:00<?]
'The video features a man conducting a brief test recording. During the recording, he recites the numbers "1 2 3, 4 5 6" and introduces himself as Hamil Hussain. He is bald, wears glasses and a dark polka-dotted shirt, and is seated indoors in front of a window.'

File analysis

Great for extracting information from PDFs or images:

gem("3 sentence summary of this presentation.", "NewFrontiersInIR.pdf")
'This presentation introduces "New Frontiers in IR," focusing on enabling Information Retrieval systems to perform instruction following and reasoning, much like Large Language Models (LLMs). It proposes two main systems: "Promptriever," a fast, instruction-trained bi-encoder, and "Rank1," a powerful but slower reasoning-based reranker that leverages LLMs for test-time computation. Through these approaches, the research demonstrates that retrievers can be made promptable and capable of sophisticated reasoning, significantly improving performance on complex queries and unlocking new possibilities for search beyond traditional keyword matching.'
gem("What's in this image?", "anton.png")
'This image is a striking visual, likely a thumbnail for a video or article, set against a dark, deep blue or black background. It combines text, an emoji, a human face, and a technical diagram.\n\nHere\'s a breakdown of its contents:\n\n1.  **Text:**\n    *   In the upper left, large white text reads "Single Vector?".\n    *   Below that, a prominent stack of large, yellow, capitalized text spells out:\n        *   "YOU\'RE"\n        *   "MISSING"\n        *   "OUT"\n    *   On the right side, within a blue rectangular box, white text reads "RAG".\n\n2.  **Emoji:**\n    *   A yellow, sad or worried emoji with downturned eyes and mouth is positioned above and slightly to the left of the "YOU\'RE" text.\n\n3.  **Person:**\n    *   A young man with light skin, short brown hair, and a wide, friendly smile is prominently featured on the bottom left side of the image. He is wearing a white v-neck or t-shirt. He appears to be looking directly at the viewer.\n\n4.  **Diagram/Graphics:**\n    *   On the right side, a stylized diagram is depicted with glowing blue elements. It consists of multiple interconnected blue circles (nodes) joined by blue lines (edges), resembling a neural network or a data flow graph.\n    *   Arrows indicate a flow from a denser cluster of nodes at the top-right towards a more linear path of nodes at the bottom-right.\n    *   This flow ultimately leads into a rectangular blue box with rounded corners and a lighter blue border, which contains the "RAG" text.\n\nThe overall composition suggests a topic related to technology, possibly artificial intelligence or data processing (given "Vector" and "RAG" which stands for Retrieval-Augmented Generation in AI), implying that the viewer might be at a disadvantage if they are only using a "Single Vector" approach.'

Text file analysis

Works with common text formats like .txt, .vtt, .md:

gem("What type of file is this?", "_test_files/sample.txt")
'This is a **plain text file**.\n\nSpecifically, the content itself mentions "Testing text/plain MIME type support," which confirms its type.'
gem("How many subtitle entries are in this file?", "_test_files/sample.vtt")
'There are **2** subtitle entries in this file.'
gem("List the items in this markdown file.", "_test_files/sample.md")
'The list items in this Markdown file are:\n\n*   Item 1\n*   Item 2'

Change Model

You can also control the model and thinking time:

gem("What is Hamel Husain's current job?", model="gemini-2.5-pro")
"As of my last update, Hamel Husain's current job is **Head of Machine Learning at Outerbounds**.\n\nHe joined Outerbounds in November 2023. Outerbounds is a company focused on building tools for MLOps and is the commercial entity behind the popular open-source project **Metaflow**, which originated at Netflix.\n\nBefore this role, he was a well-known Staff Machine Learning Engineer at **Airtable**. He is also highly regarded in the data science community for his open-source work, including creating projects like `nbdev` and `fastpages`, and for his affiliation with **fast.ai**."

Multiple Attachments

You can analyze multiple files/URLs at once by passing a list:

prompt = "Is this PDF and YouTube video related or are they different talks? Answer with very short yes/no answer."
gem(prompt, ["https://youtu.be/Trps2swgeOg?si=yK7CO0Zk4E1rfp6s", "NewFrontiersInIR.pdf"])
'No.'
gem(prompt, ["https://youtu.be/YB3b-wPbSH8?si=WI0LqflY5SYIsRz9", "NewFrontiersInIR.pdf"])
'Yes.'
gem("What do these slides and this video have in common in terms of content/subject matter if at all? Provide a 1 sentence summary of each.", ["NewFrontiersInIR.pdf", "_videos/test_video.mp4"])
 |█---------------------------------------| 3.33% [1/30 00:10<04:53]
'The video and the slides **do not share any common content or subject matter**.\n\n*   **Video Summary:** The video is a brief personal test recording where the speaker introduces himself and counts numbers to check the audio/video quality.\n*   **Slides Summary:** The slides present research on "New Frontiers in IR: Instruction Following and Reasoning," detailing how information retrieval systems can be designed to understand and execute natural language instructions and perform reasoning, similar to large language models, through models like Promptriever and Rank1.'