Writing Utils

Various utilities to help with writing

PDF to Images

Split PDF files into individual slide images. Requires poppler-utils installed (brew install poppler on macOS or apt-get install poppler-utils on Ubuntu).

source

pdf2imgs

 pdf2imgs (pdf_path, output_dir='.', prefix='slide')

Split a PDF file into individual slide images using poppler’s pdftoppm.

For example, you can split the NewFrontiersInIR.pdf file into individual slide images:

# Split NewFrontiersInIR.pdf into individual slides
output_folder = "slides_output"
image_files = pdf2imgs("NewFrontiersInIR.pdf", output_dir=output_folder)

# Show number of slides created
print(f"Created {len(image_files)} slide images in {output_folder}/")

Created 65 slide images in slides_output/

!rm -rf slides_output/

Gather Context From Webpages

I often want to gather context from a set of web pages.

source

gather_urls

 gather_urls (urls, tag='example')

Gather contents from URLs.

source

jina_get

 jina_get (url)

Get a website as md with Jina.

For example, these are what I might use as context for annotated posts

_annotated_post_content = gather_urls(_annotated_post_urls)
print(_annotated_post_content[:500])

<examples>
<example-1>
Title: 

URL Source: https://raw.githubusercontent.com/hamelsmu/hamel-site/refs/heads/master/notes/llm/rag/p1-intro.md

Markdown Content:
---
title: "P1: I don't use RAG, I just retrieve documents"
description: "Ben Clavié's introduction to advanced retrieval techniques"
image: p1-images/slide_12.png
date: 2025-06-25
---

As part of our [LLM Evals course](https://bit.ly/evals-ai){target="_blank"}, I hosted [Benjamin Clavié](https://ben.clavie.eu/){target="_blank"} to kick

source

outline_slides

 outline_slides (slide_path)

_o = outline_slides('NewFrontiersInIR.pdf')
print(_o[:300])

Here is a one-sentence summary for each slide:

1.  This slide introduces the presentation "New Frontiers in IR: Instruction Following and Reasoning" by Orion Weller from Johns Hopkins Whiting School of Engineering.
2.  This slide shows a "Message ChatGPT" interface with a prominent "Search" button,

Annotated Posts From Talk

source

generate_annotated_talk_post

 generate_annotated_talk_post (slide_path, youtube_link, image_dir,
                               transcript_path=None, example_urls=['https:
                               //raw.githubusercontent.com/hamelsmu/hamel-
                               site/refs/heads/master/notes/llm/rag/p1-
                               intro.md', 'https://raw.githubusercontent.c
                               om/hamelsmu/hamel-site/refs/heads/master/no
                               tes/llm/rag/p2-evals.md', 'https://raw.gith
                               ubusercontent.com/hamelsmu/hamel-site/refs/
                               heads/master/notes/llm/evals/inspect.qmd'])

Assemble the prompt for the annotated post.

Example Post

post = generate_annotated_talk_post(slide_path='orion_example/NewFrontiersInIR.pdf',
                                    youtube_link='https://youtu.be/YB3b-wPbSH8?si=u_x0Puwreld3YCGf',
                                    image_dir='orion_example/p3_images',
                                    transcript_path='orion_example/transcript.md')

Path('orion_example/p3_orion.qmd').write_text(post)