Discord Channel Fetcher with Thread Support

Imports and Setup

# For running in notebooks
if in_notebook():
    import nest_asyncio
    nest_asyncio.apply()

Core Functions

Fetch Complete Channel History

Organize Messages with Thread Structure

Create Thread-Aware Conversations

Main Fetch Function

CHANNEL_ID = 1369370266899185746  # Replace with your channel ID
original, simplified = await fetch_discord_msgs(CHANNEL_ID, print_summary=False)
Connected as hamml#3190
simplified
{'channel': 'lesson-3-implementing-effective-evaluations',
 'guild': 'AI Evals For Engineers & Technical PMs',
 'conversations': [{'question': {'author': 'davidh5633',
    'content': 'Seems relevant to this course topic:\n\nEvaluation Driven Development for Agentic Systems\nhttps://www.newsletter.swirlai.com/p/evaluation-driven-development-for?utm_source=tldrai',
    'id': '1376753833543204936',
    'timestamp': '2025-05-27T02:48:12.766000+00:00'}},
  {'question': {'author': 'davidh5633',
    'content': 'I gather that alignment for the LLM-Judge is a larger up-front process as discussed through chapter 4, and then we need to avoid the pitfall of not revisiting it weekly.  \n> Moreover, even after alignment, many teams fail to revisit the process. Production data can drift. New failure modes may emerge, LLM updates may shift behavior (Chen et al. 2024), and evaluation metrics may evolve (Shankar et al. 2024c). We recommend re-running the alignment process regularly (e.g., weekly): continue labeling a handful of\n> traces, recomputing FPR and FNR, and checking whether confidence intervals remain acceptably tight.\n\nQuestions:\n* (was that a typo, and it should be Re-compute TPR and TNR?)\n* The continued labeling a handful of traces is to go into test set (or could be for the test or train set too)?',
    'id': '1376941305556631573',
    'timestamp': '2025-05-27T15:13:09.577000+00:00'}},
  {'question': {'author': 'davidh5633',
    'content': "Overall, this chapter definitely pointed out a 'hidden cost' of adding features to an LLM application, especially features that need to be evaluated by an LLM-as-Judge (when code-based checks are not feasible/practical)",
    'id': '1376942006063992842',
    'timestamp': '2025-05-27T15:15:56.591000+00:00'}},
  {'question': {'author': 'sh_reyashankar',
    'content': 'Aha yes it should be TPR and TNR, not FPR and FNR. Thank you!',
    'id': '1376942993839357992',
    'timestamp': '2025-05-27T15:19:52.095000+00:00'}},
  {'question': {'author': 'sh_reyashankar',
    'content': 'Fixed the typo about adding traces — see https://discord.com/channels/1368666390185246770/1374477267992051732/1376690011772162099',
    'id': '1376943220042371204',
    'timestamp': '2025-05-27T15:20:46.026000+00:00'}},
  {'question': {'author': 'afogel',
    'content': 'question -- when we fix the prompt, and it risks introducing new failure modes, right? How often does that impact the convergence of failure modes/reaching theoretical saturation?',
    'id': '1376970351526940846',
    'timestamp': '2025-05-27T17:08:34.676000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'love this question. as you iterate on the prompts you will see fewer and fewer _new_ failure modes emerge. theoretical saturation is when you have no _new_ failure modes. the iterative process will eventually converge, always...for simpler domains it might take only a couple of rounds; for complex domains it might take even up to 10',
     'id': '1376990673005445252',
     'timestamp': '2025-05-27T18:29:19.694000+00:00'}]},
  {'question': {'author': 'solomonogram',
    'content': "For the second sql constraint error, why wouldn't you try to fix it by changing the prompt and adding separate instructions?",
    'id': '1376970516883177543',
    'timestamp': '2025-05-27T17:09:14.100000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "i cannot remember which one this is 😂 \n\nbut text to SQL AI evals are consistently far from perfect, < 70% on BIRD and Spider benchmarks, so it's a known failure mode that LLMs cannot generate perfect SQL . when something is a known failure mode, it's useful to design an application-specific eval around 🙂",
     'id': '1376991172085420163',
     'timestamp': '2025-05-27T18:31:18.684000+00:00'}]},
  {'question': {'author': 'dantreasure_42366',
    'content': 'I think they\'re saying in this case you "clearly stated the SQL constraint"',
    'id': '1376970599812829255',
    'timestamp': '2025-05-27T17:09:33.872000+00:00'}},
  {'question': {'author': 'forceten2112',
    'content': 'question - how do we assess that we fixed a specification failure when changing the prompt? do we need to look through another ~100 traces?',
    'id': '1376970736417116341',
    'timestamp': '2025-05-27T17:10:06.441000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': "We are talking about really obvious things.  For example, the user asks what hours your store is open but that clearly isn't in the prompt or available to your LLM because you forgot to give it that info",
     'id': '1376972054347386900',
     'timestamp': '2025-05-27T17:15:20.660000+00:00'},
    {'author': 'hamelh',
     'content': 'These are the "specification" errors',
     'id': '1376972359411433635',
     'timestamp': '2025-05-27T17:16:33.393000+00:00'},
    {'author': 'hamelh',
     'content': 'If you give it the info or if it is available to your model either through retirieval or otherwhise, then this is more of a generalization-ish issue',
     'id': '1376972491817222224',
     'timestamp': '2025-05-27T17:17:04.961000+00:00'},
    {'author': 'forceten2112',
     'content': 'once you provide the info you original missed, should you then assess if there is a generalization issue with that specification?',
     'id': '1376974391815966911',
     'timestamp': '2025-05-27T17:24:37.956000+00:00'},
    {'author': 'forceten2112',
     'content': 'like rerun the agent and go thru traces again',
     'id': '1376974441959129208',
     'timestamp': '2025-05-27T17:24:49.911000+00:00'},
    {'author': 'sh_reyashankar',
     'content': "i like to at least quickly go through 30 traces after rerun to see if the failure mode occurs again. if it does not occur again, i don't prioritize this as an eval",
     'id': '1376991442911887432',
     'timestamp': '2025-05-27T18:32:23.254000+00:00'}]},
  {'question': {'author': 'muraleedaaran',
    'content': "<@525830737627185170> <@893327214685343804> But, how do we even say our prompts are clear in case of Generalization failure? Because with prompts I could see that there is always a room for improvement and I don't see there is any limit to fine tuning the prompts",
    'id': '1376971046699274261',
    'timestamp': '2025-05-27T17:11:20.418000+00:00'},
   'thread': [{'author': 'davidh5633',
     'content': "As we'll see with the real cost of setting up an LLM judge, if we can fix the issue with an improved prompt, then do that first, by all means.  Otherwise, as you rely on the LLM to generalize an answer based on the user input (and input from tools, rag, etc.), you'll need to focus on these Generalization failures.",
     'id': '1376972531998785728',
     'timestamp': '2025-05-27T17:17:14.541000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'I would say make an effort to specify (enough for a human to understand that this is a preference to care about), and then if the LLM doesn\'t "listen", start quantifying as a generalization failure mode',
     'id': '1376987560148205629',
     'timestamp': '2025-05-27T18:16:57.531000+00:00'}]},
  {'question': {'author': 'taaniya_31433',
    'content': '<@525830737627185170> <@893327214685343804> \nQuestion -\nWhen an LLM performs as expected but inconsistently, is it a specification or a generalization failure?\nWhat should we fix in such case - prompt language or add more clear instructions?\nand how do we evaluate LLM outputs when they perform inconsistently?',
    'id': '1376971287070638253',
    'timestamp': '2025-05-27T17:12:17.727000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'LLM inconsistency is a generalization failure in my view. if you have clearly specified the instruction (like a human could understand), then the failure is in the LLM to generalize applying it to infinite data\n\nwe have a section on "group-wise metrics" in chapter 4 in the course reader. use these for evaluating inconsistency 🙂 \n\nas for fixing: you can use a more powerful model, sample many answers and have another LLM pick the best out of the sampled answers, decompose the task into mulitple steps...lots of options and unfortunately no clear answer (need to experiment); it\'s specific to the application',
     'id': '1376992743493996665',
     'timestamp': '2025-05-27T18:37:33.337000+00:00'},
    {'author': 'taaniya_31433',
     'content': 'got it, thanks!',
     'id': '1376996219028770878',
     'timestamp': '2025-05-27T18:51:21.969000+00:00'}]},
  {'question': {'author': 'nate1363',
    'content': 'where do ML models like NLI, etc fall into the mix for evals and guardrails?',
    'id': '1376971673785471016',
    'timestamp': '2025-05-27T17:13:49.927000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'great question -- if they are imperfect i treat them just like LLM as Judge (compute TPR and TNR on a held out test set and correct bias)\n\nif they are perfect or 99%+ (or 99.5%+), then they can be treated like a code based evaluator and run without corrections',
     'id': '1376992057951780904',
     'timestamp': '2025-05-27T18:34:49.891000+00:00'}]},
  {'question': {'author': 'zachdev',
    'content': "Sometimes the code-based evaluators can also be put right in your pipeline to improve performance.\n\nFor example an Agent that mentions file paths sometimes hallucinates paths.\n\nYou can eval against the existence of file paths using file system tools.\n\nAnd then take the code that validates the existence of the file path and strip bad paths right out of the agent's structured output.",
    'id': '1376972137298001994',
    'timestamp': '2025-05-27T17:15:40.437000+00:00'},
   'replies': [{'author': 'drz1535',
     'content': 'Make your system more robust with these code-based evaluators for error handling',
     'id': '1376972433134977165',
     'timestamp': '2025-05-27T17:16:50.970000+00:00'}]},
  {'question': {'author': 'jordanmeyer_93947',
    'content': 'What are your thoughts on measuring response consistency using an embedding model and cosine similiarity across repeated questions?',
    'id': '1376972329199866057',
    'timestamp': '2025-05-27T17:16:26.190000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'great question -- check out section 4.8 in the course reader 🙂',
     'id': '1376992935781990472',
     'timestamp': '2025-05-27T18:38:19.182000+00:00'}]},
  {'question': {'author': 'wgpubs',
    'content': 'Curious to hear from folks what models have worked really well for LLM-as-Judge (or as jury) who have done the alignment in practice?',
    'id': '1376972469088288891',
    'timestamp': '2025-05-27T17:16:59.542000+00:00'},
   'replies': [{'author': 'zeewaheed_24711',
     'content': 'My rule of thumb: start with the  most expensive and sophisticated LLMs you can for your judge models and work on aligning with human annotators',
     'id': '1376972660226920559',
     'timestamp': '2025-05-27T17:17:45.113000+00:00'}]},
  {'question': {'author': 'old_habits_die_screaming',
    'content': 'Can you explain what you meant by using the code-based evals on the "critical path", please?',
    'id': '1376972472460509285',
    'timestamp': '2025-05-27T17:17:00.346000+00:00'},
   'thread': [{'author': 'zachdev',
     'content': '<@1374307606734180446> This usually refers to the "core job" that the software is using. For example a tool that *must* read some file as part of its work would have this "file read" tool capability and use as part of the "critical path"',
     'id': '1376972698017599529',
     'timestamp': '2025-05-27T17:17:54.123000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'yes and i also meant in real time. as soon as the LLM generates an answer, maybe before serving it to the end user, run the code based eval and if it fails, regenerate the LLM answer',
     'id': '1376993281249902825',
     'timestamp': '2025-05-27T18:39:41.548000+00:00'},
    {'author': 'hamelh',
     'content': 'These two might be interesting\n\nhttps://hamel.dev/blog/posts/evals-faq/can-my-evaluators-also-be-used-to-automatically-fix-or-correct-outputs-in-production.html\n\nand\n\nhttps://hamel.dev/blog/posts/evals-faq/whats-the-difference-between-guardrails-evaluators.html',
     'id': '1399848199954501714',
     'timestamp': '2025-07-29T20:16:58.791000+00:00'}]},
  {'question': {'author': 'deweypotts',
    'content': 'Would you recommend using LLM-as-judge for VLMs? For example, I am using a VLM to check if restaurant tables are clean.',
    'id': '1376972473899290757',
    'timestamp': '2025-05-27T17:17:00.689000+00:00'},
   'thread': [{'author': 'afogel',
     'content': 'wondering whether that might be also accomplished more deterministically using a more traditional classifier, since it seems like a binary classifier?',
     'id': '1376972708830642187',
     'timestamp': '2025-05-27T17:17:56.701000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'yes you absolutely can! though if you know you want to run this classifier at scale, it might be cheaper to fine tune an image classifier model to serve as the judge',
     'id': '1376993443431059547',
     'timestamp': '2025-05-27T18:40:20.215000+00:00'}]},
  {'question': {'author': 'jodicasa',
    'content': 'Is it okay to have a related LLM (different version or different model, same developer) serve as the judge? GPT4 vs GPT4o?',
    'id': '1376972654510211092',
    'timestamp': '2025-05-27T17:17:43.750000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "absolutely. the judge is doing a different task than your main LLM pipeline. the judge is doing a narrowly scoped binary classification task. so actually you don't have to worry about the judge not being able to do the main task...since it doesn't have to 🙂",
     'id': '1376993656589910026',
     'timestamp': '2025-05-27T18:41:11.036000+00:00'},
    {'author': 'hamelh',
     'content': 'You might like this note as well\n\nhttps://hamel.dev/blog/posts/evals-faq/can-i-use-the-same-model-for-both-the-main-task-and-evaluation.html',
     'id': '1399848382754848891',
     'timestamp': '2025-07-29T20:17:42.374000+00:00'}]},
  {'question': {'author': 'vbaptista',
    'content': 'The LLM-as-a-judge evaluation/optimization seems perfect for a tool like DSPy',
    'id': '1376972955682345076',
    'timestamp': '2025-05-27T17:18:55.555000+00:00'}},
  {'question': {'author': 'wgpubs',
    'content': 'also sharing a good post on building LLM as judge (at least imo) ... https://www.philschmid.de/llm-evaluation',
    'id': '1376973542431920208',
    'timestamp': '2025-05-27T17:21:15.447000+00:00'}},
  {'question': {'author': 'zachdev',
    'content': 'Has anyone had a good experience using code-based tools like, for example, levenshtein distance to help measure these LLM-as-Judge subjective stuff?',
    'id': '1376973589118455839',
    'timestamp': '2025-05-27T17:21:26.578000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "personally no -- it's hard to make these fuzzy code-based metrics interpretable. at the end of the day i need a binary decision (good or bad). but perhaps others have!!",
     'id': '1376994712564666479',
     'timestamp': '2025-05-27T18:45:22.800000+00:00'}],
   'replies': [{'author': 'jonathanmccoy',
     'content': 'FYI, I found this interesting for objective vs subjective eval strategies. https://docs.google.com/presentation/d/1o5-nyh_alSttFgyiwsSHlIyvMCYkO1r8CCUJHtxAXNI/edit?slide=id.g2f84f77c908_0_457#slide=id.g2f84f77c908_0_457',
     'id': '1376979962485149820',
     'timestamp': '2025-05-27T17:46:46.107000+00:00'}]},
  {'question': {'author': 'darkwrath.',
    'content': 'so you have an LLM prompt for each failure case you want to catch (at scale)?',
    'id': '1376973664830095370',
    'timestamp': '2025-05-27T17:21:44.629000+00:00'},
   'thread': [{'author': 'davidh5633',
     'content': "For each failure case (from the axial coding that we did) where you can't handle with fixing the prompt or a code-based evaluator.",
     'id': '1376974444039245836',
     'timestamp': '2025-05-27T17:24:50.407000+00:00'},
    {'author': 'darkwrath.',
     'content': 'Right - just double checking my undertanding. Seems very expensive',
     'id': '1376974736055337001',
     'timestamp': '2025-05-27T17:26:00.029000+00:00'},
    {'author': 'davidh5633',
     'content': 'Yeah, this is a clear cost to these types of evals...need to push the product manager to see if such features are worth it.',
     'id': '1376977628527067206',
     'timestamp': '2025-05-27T17:37:29.648000+00:00'},
    {'author': 'darkwrath.',
     'content': 'I am the product manager 😉',
     'id': '1376978205646520483',
     'timestamp': '2025-05-27T17:39:47.244000+00:00'},
    {'author': 'davidh5633',
     'content': 'Well then, you need to make sure the business is good with these cost, eh?',
     'id': '1376978444277256193',
     'timestamp': '2025-05-27T17:40:44.138000+00:00'},
    {'author': 'sh_reyashankar',
     'content': "yes correct -- different judge per failure mode. having one judge do more than one failure mode limits the judge's accuracy and is less interpretable. however it may not be that expensive (use cheap judge models), and maybe there are ~10 failure modes. the process is table stakes for shipping products that work!!",
     'id': '1376995088801988659',
     'timestamp': '2025-05-27T18:46:52.502000+00:00'}]},
  {'question': {'author': 'adityakabra9521',
    'content': 'Which LLM do we use as the judge? the mini or the thinking models?',
    'id': '1376973707397828750',
    'timestamp': '2025-05-27T17:21:54.778000+00:00'},
   'thread': [{'author': 'davidh5633',
     'content': 'See discussion above about the recommendation to start with the best models first, then work your way down.',
     'id': '1376974717390684242',
     'timestamp': '2025-05-27T17:25:55.579000+00:00'}]},
  {'question': {'author': 'afogel',
    'content': "It's possible that you'll address this directly later on in the lecture, but in my experience doing qual research, i would perform IRR (and robustness checks of inter-rater agreement [e.g. Shaffer's Rho https://par.nsf.gov/servlets/purl/10162355 ])  for each code in my codebook. Do you recommend these checks for each identified failure mode for the judges?",
    'id': '1376973712674394314',
    'timestamp': '2025-05-27T17:21:56.036000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "good q. we talk about collaborative evaluation on thursday. we will cover computing IRR to make sure humans are aligned on the labels for the test set.\n\nbut in my opinion computing IRR between judge and human doesn't make too much sense, since the human definitions are the ground truth and the judge serves as a classifier...and TPR and TNR tell us how aligned the classifier is with the ground truth",
     'id': '1376995640206430229',
     'timestamp': '2025-05-27T18:49:03.967000+00:00'}]},
  {'question': {'author': 'dantreasure_42366',
    'content': "for what it's worth I think gemini is the best llm provider for structured outputs",
    'id': '1376973860569743571',
    'timestamp': '2025-05-27T17:22:31.297000+00:00'},
   'replies': [{'author': 'wgpubs',
     'content': 'gemini is my goto for everything ... unbeatable from a cost/performance perspective (prove me wrong, haha)',
     'id': '1376974018757918831',
     'timestamp': '2025-05-27T17:23:09.012000+00:00'}]},
  {'question': {'author': 'erle_59641',
    'content': 'oh no... 😞  my rule of thumb was to us the least sophisticated, fastest LLM such that If it worked I could add it to the chain/critical path.',
    'id': '1376973930065301544',
    'timestamp': '2025-05-27T17:22:47.866000+00:00'},
   'thread': [{'author': 'zeewaheed_24711',
     'content': "That's not necessarily bad if you wanna incorporate them into the critical path, but in my experience the biggest value of the judges is when you're using them as part of the iteration or error analysis process in which case you totally want better fidelity vs. latency. For live judging as you said, smaller LLMs or fine-tuned approaches (Eugene Yan's post here is a really cool exploration of finetuned BERT that is more focused and very low latency for a specific task: https://eugeneyan.com/writing/evals/) and other techniques that might encode your intutitions/failure cases derived from bigger judges and then make them specialized for that.",
     'id': '1376975693342310522',
     'timestamp': '2025-05-27T17:29:48.264000+00:00'}]},
  {'question': {'author': 'artste',
    'content': 'Is it better to run one “judgement pass” for each one of the potential issues or make a big prompt with all the judgements together?',
    'id': '1376973941918273589',
    'timestamp': '2025-05-27T17:22:50.692000+00:00'},
   'thread': [{'author': 'davidh5633',
     'content': "One per so that you can get the metrics, which they'll cover. it's in chapter 4 on how to measure the effectiveness of your judge.",
     'id': '1376975202583445735',
     'timestamp': '2025-05-27T17:27:51.258000+00:00'}]},
  {'question': {'author': '.firepaul',
    'content': 'any reason why you would "ask the LLM for json" in the prompt as opposed to using something like `instructor` and the structured outputs mode to "force" conformance to a given schema?',
    'id': '1376974097401253938',
    'timestamp': '2025-05-27T17:23:27.762000+00:00'},
   'thread': [{'author': 'zachdev',
     'content': 'performance on that seems to depend on model/vendor. The formal structured output stuff has become better and better over time though',
     'id': '1376974255371063358',
     'timestamp': '2025-05-27T17:24:05.425000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'you can totally use instructor! i have been trying to make the content tool-agnostic 🙂',
     'id': '1376995852840599707',
     'timestamp': '2025-05-27T18:49:54.663000+00:00'},
    {'author': '.firepaul',
     'content': 'Sounds good makes sense ty',
     'id': '1377043869635706910',
     'timestamp': '2025-05-27T22:00:42.759000+00:00'}]},
  {'question': {'author': '__ii__i',
    'content': 'Silly question, how do folks structure their projects to manage multiple system prompts, multiple tests and judges per system prompt, etc? Mine tend to become a mess.',
    'id': '1376974179605282936',
    'timestamp': '2025-05-27T17:23:47.361000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "not a silly question at all. i have no good answer for this. i try to have a prompts directory so i don't have prompts scattered in my code, but that's about it",
     'id': '1376996175986823269',
     'timestamp': '2025-05-27T18:51:11.707000+00:00'}]},
  {'question': {'author': 'ashikshaffi08',
    'content': 'How about assigning scores like correctness, similarity etc and give it a range (0-1) for the LLM as Judge?',
    'id': '1376974219757486080',
    'timestamp': '2025-05-27T17:23:56.934000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'with an LLM as Judge, you have to align _any_ output it predicts. binary (0 or 1) is hard enough, since you have to align both pass and fail definitions. range scores are much much harder to align and we thus caution against it, unless you have a lot of resources to come up with unambiguous definitions for each score, collect enough data for each score, and maybe even fine tune the LLM Judge',
     'id': '1376996628421935155',
     'timestamp': '2025-05-27T18:52:59.576000+00:00'}],
   'replies': [{'author': 'dantreasure_42366',
     'content': 'I think they recommend simple pass/ fail as you will run into issues the more ambiguous you get e.g. a range score',
     'id': '1376974642920554567',
     'timestamp': '2025-05-27T17:25:37.824000+00:00'}]},
  {'question': {'author': 'britter3116_22491',
    'content': "When writing these llm as a judge criteria, it seems you'll end up with a number of these small evals. \n\nDo you necessarily need a single LLM call for each? Or can you use structured output to guide something like (criteria 1 pass/fail, criteria 2 pass/fail ... etc)",
    'id': '1376974582581559487',
    'timestamp': '2025-05-27T17:25:23.438000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'yeah, folks can end up with ~10 LLM as Judges. you can try to merge criteria together but the alignment process is  tricker',
     'id': '1376996849193324634',
     'timestamp': '2025-05-27T18:53:52.212000+00:00'}],
   'replies': [{'author': 'vbaptista',
     'content': 'It would be great to have a single LLM call to run multiple "judges", but my feeling is that this will increase the errors of the judge, so probably multiple LLM calls would be better. But again, we\'ll only know for sure if we try.',
     'id': '1376974939390738512',
     'timestamp': '2025-05-27T17:26:48.508000+00:00'}]},
  {'question': {'author': 'kobayashi_hasimoto',
    'content': 'How to approach time sensitive jobs? Generating a response takes usually a significant amount of time. Running a LLM judge is a fully sequential work - we need to have the whole reponse stream fetched to even start evaluating. Am I correct?',
    'id': '1376974724252434573',
    'timestamp': '2025-05-27T17:25:57.215000+00:00'},
   'thread': [{'author': 'davidh5633',
     'content': 'My understanding is that you can do this judgement async on all or a subset of traces, instead of on every request in real-time. But for some applications that need really high accuracy even at the expense of the UX to wait for these evals to finish, then some folks do this in real-time on every request.',
     'id': '1376976144036925440',
     'timestamp': '2025-05-27T17:31:35.718000+00:00'}]},
  {'question': {'author': 'pawel.huryn',
    'content': "Once we have enough examples that fail/pass, evaluated also by humans, wouldn't it be more cost efficient to use them to fine-tune LLM judges? Do we fine-tune judges at all? 🙂 Does it, in practice, increase the complexity? (I understood it does). Is the iterative nature of evals (never stop looking at data) a problem here?",
    'id': '1376975045515280498',
    'timestamp': '2025-05-27T17:27:13.810000+00:00'},
   'thread': [{'author': '.firepaul',
     'content': "my mental model is that fine tuning doesn't necessarily grant the model great generalization abilities e.g. https://x.com/AndrewLampinen/status/1918347839232721123",
     'id': '1376975836481192087',
     'timestamp': '2025-05-27T17:30:22.391000+00:00'},
    {'author': '.firepaul',
     'content': 'which is part of why you want to iterate on the prompt or the few shot examples rather than fine tune, in addition to the extra complexity that FT adds',
     'id': '1376975936355827773',
     'timestamp': '2025-05-27T17:30:46.203000+00:00'},
    {'author': '.firepaul',
     'content': "john schulman's talk on RLHF from 2023 also has some interesting stuff on this https://www.youtube.com/watch?v=hhiLw5Q_UFg",
     'id': '1376976154505908327',
     'timestamp': '2025-05-27T17:31:38.214000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'yeah you can absolutely fine tune LLM judges! and if you are running at enough scale it definitely makes sense to 🙂 \n\nit increases complexity in that you now have to fine tune the model, constrantly retrain it when a new base version comes out, and figure out how to serve the model.\n\nregardless of the type of evaluator (fine tune or off the shelf LLM judge), you should never stop looking at data 🙂',
     'id': '1376997283618619473',
     'timestamp': '2025-05-27T18:55:35.787000+00:00'}],
   'replies': [{'author': 'zanetworker',
     'content': 'Fine-tuning can be expensive as well.',
     'id': '1376975478363000913',
     'timestamp': '2025-05-27T17:28:57.009000+00:00'}]},
  {'question': {'author': 'zanetworker',
    'content': 'Do we go through the whole Analyze process again for the Evalulator? its a new use-case or?',
    'id': '1376975548454010900',
    'timestamp': '2025-05-27T17:29:13.720000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "you typically don't have to go through the entire Analyze process again because the LLM Judge is much tighter scoped (just evaluating pass or fail for a specific failure mode). however if you find that you are unable to get good TPR and TNR on the LLM judge on your dev set, you can go through a short error analysis to understand why / improve your LLM judge prompt",
     'id': '1376997763447001088',
     'timestamp': '2025-05-27T18:57:30.187000+00:00'}]},
  {'question': {'author': 'wgpubs',
    'content': "Chapter 1 of the fast.ai fastbook has some good discussion on building train/eval/test datasets ... I've summarized bits here but I'd recommend reading thru the book!\n\nhttps://ohmeow.com/posts/2020-11-06-ajtfb-chapter-1.html#training-validation-and-test-datasets",
    'id': '1376975630868025376',
    'timestamp': '2025-05-27T17:29:33.369000+00:00'}},
  {'question': {'author': 'wgpubs',
    'content': 'I want a <@525830737627185170> short video on pros/cons of DSPy 🙂',
    'id': '1376975844077207592',
    'timestamp': '2025-05-27T17:30:24.202000+00:00'}},
  {'question': {'author': 'dantreasure_42366',
    'content': 'in case anyone missed the definition of TPR/TNR (page 45 of the reader)',
    'id': '1376975951384154193',
    'timestamp': '2025-05-27T17:30:49.786000+00:00'}},
  {'question': {'author': 'afogel',
    'content': 'what are reasonable accuracy thresholds to shoot for? What are the implications for robustness in production?',
    'id': '1376976473952489582',
    'timestamp': '2025-05-27T17:32:54.376000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'i try to shoot for at least 80%+ TPR and TNR. the implications are that you need to correct the observed success rate in production. if you have a low TNR, you overestimate the success rate. if you have a low TPR (which happens more rarely with LLM judges, since LLMs seem to be trained to say yes a lot), you drastically underestimate the success rate',
     'id': '1376998300401537114',
     'timestamp': '2025-05-27T18:59:38.207000+00:00'},
    {'author': 'afogel',
     'content': 'a quick followup -- wrt to implications for robustness in production, I supposed 80% accuracy yields a defensible f1 score, but...in a critical system, this suggests that your error margins are still pretty high, no? I suppose this also has implications wrt what are reasonable usecases for LLMs 🙃',
     'id': '1376999085051084830',
     'timestamp': '2025-05-27T19:02:45.282000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'ah, i meant 80% for the LLM as Judge. not the LLM pipleine we are evaluating with the judge. for accuracy for the main LLM pipeline, every use case has a different acceptance criteria 🙂',
     'id': '1377008799902208100',
     'timestamp': '2025-05-27T19:41:21.483000+00:00'}]},
  {'question': {'author': '__ii__i',
    'content': "Ok, you've trained a judge, you've used the test set, and it's awful. Your test data is spoiled, you've seen it. Does that mean you need to go back to hand labelling and constructing a new test set?",
    'id': '1376976608883114045',
    'timestamp': '2025-05-27T17:33:26.546000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "you can make a judgement call here...if you've seen too much of the test set i recommend making a new one, or at least creating a different split of the labeled data you already have",
     'id': '1376998456081514517',
     'timestamp': '2025-05-27T19:00:15.324000+00:00'}]},
  {'question': {'author': 'artste',
    'content': 'It’s fun that we’re OVERFITTING even if no real training is happening… we’re overfitting the prompt to the dev set 😂',
    'id': '1376976616856490025',
    'timestamp': '2025-05-27T17:33:28.447000+00:00'},
   'thread': [{'author': 'sundeep_00965',
     'content': '> trained a judge \nas a follow up, in this context  a "judge" is the combination of the (prompt, the LLM model), correct?',
     'id': '1376977141195079742',
     'timestamp': '2025-05-27T17:35:33.459000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'correct',
     'id': '1376998566463144067',
     'timestamp': '2025-05-27T19:00:41.641000+00:00'}]},
  {'question': {'author': 'wgpubs',
    'content': 'so basically we optimizing for recall ... "the LLM judge came to same conclusion as the human(s)"',
    'id': '1376976867814412358',
    'timestamp': '2025-05-27T17:34:28.280000+00:00'},
   'thread': [{'author': '__ii__i',
     'content': 'Definition of recall and relationship to TPR/TNR',
     'id': '1376977458196250624',
     'timestamp': '2025-05-27T17:36:49.038000+00:00'}]},
  {'question': {'author': 'muraleedaaran',
    'content': '<@525830737627185170> <@893327214685343804> I have one more question. When using LLM as a judge, should we use the same model that we use for dev or should we use someother model?',
    'id': '1376976955945254963',
    'timestamp': '2025-05-27T17:34:49.292000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'you can totally use the same model--the LLM Judge is doing a different task than your main LLM in your LLM pipeline',
     'id': '1376999013013786774',
     'timestamp': '2025-05-27T19:02:28.107000+00:00'},
    {'author': 'hamelh',
     'content': 'Shreya is correct here are related thoughts incase it helps from [this section of my blog post on llm judges](https://hamel.dev/blog/posts/llm-judge/#faq)',
     'id': '1377010748315668584',
     'timestamp': '2025-05-27T19:49:06.021000+00:00'},
    {'author': 'muraleedaaran',
     'content': 'thanks <@525830737627185170> and <@893327214685343804>',
     'id': '1377149321425977395',
     'timestamp': '2025-05-28T04:59:44.425000+00:00'}],
   'replies': [{'author': 'zanetworker',
     'content': 'One thing to add here is that not all models are good evaluators, and are typically tuned for evals.',
     'id': '1376977632012271668',
     'timestamp': '2025-05-27T17:37:30.479000+00:00'}]},
  {'question': {'author': 'artste',
    'content': 'Given that results are often on deterministic, despite temperature is set to zero - How many times should we run the model and the judges? Once or say 3/5 times and average?',
    'id': '1376977555088998513',
    'timestamp': '2025-05-27T17:37:12.139000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'up to you, i think either should be fine (experiment with the error / variance your your application)',
     'id': '1376999149781647531',
     'timestamp': '2025-05-27T19:03:00.715000+00:00'}],
   'replies': [{'author': 'wgpubs',
     'content': 'maybe a LLM-as-Jury with multiple models in the same family with different hypers might be helpful???',
     'id': '1376978341147578460',
     'timestamp': '2025-05-27T17:40:19.550000+00:00'}]},
  {'question': {'author': 'zeewaheed_24711',
    'content': 'For some more reading on evaluating judges, Eugene wrote up something I found really useful here: https://eugeneyan.com/writing/llm-evaluators/',
    'id': '1376978809399803914',
    'timestamp': '2025-05-27T17:42:11.190000+00:00'}},
  {'question': {'author': 'anup007_07188',
    'content': 'I have seen people using spearmen cofficient to compare LLM Judge response and human response to check for allignment',
    'id': '1376978857806270536',
    'timestamp': '2025-05-27T17:42:22.731000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "Spearman coefficient and inter rater reliability is good to measure alignment between two human raters, before coming up with ground truth. we'll discuss thursday\n\nBut once you have ground truth labels, the LLM as Judge should be treated as a classifier. no need for inter rater reliability metrics there\n\nA separate question is using the LLM to generate the ground truth (or a jury of LLMs to create ground truth), but we do not discuss this in the course because it's very easy to do it incorrectly. human labeling should come first and is the highest ROI for AI products",
     'id': '1377001104772042762',
     'timestamp': '2025-05-27T19:10:46.821000+00:00'}]},
  {'question': {'author': '__ii__i',
    'content': 'Now what about the Human Judge True Success Rate, vs the Ideal Human 🙂',
    'id': '1376978863917109259',
    'timestamp': '2025-05-27T17:42:24.188000+00:00'}},
  {'question': {'author': 'pawel.huryn',
    'content': 'If LLMs struggle to analyze multiple failure modes at once, how can we be sure that humans can do it reliably?\nHave you ever tested humans for TPR and TNR, for example, by asking people to evaluate just one simple rule at a time?',
    'id': '1376979079282167818',
    'timestamp': '2025-05-27T17:43:15.535000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "really good question -- we will talk about this in thursday's lecture + how to align humans!",
     'id': '1377001267640930334',
     'timestamp': '2025-05-27T19:11:25.652000+00:00'}]},
  {'question': {'author': '.firepaul',
    'content': '👏 for emphasizing that these metrics should be presented with uncertainty estimates',
    'id': '1376979237998690334',
    'timestamp': '2025-05-27T17:43:53.376000+00:00'},
   'thread': [{'author': '.firepaul',
     'content': 'have seen way too many A/B test results presented as "these conversion metrics are ironclad truths from a higher power"',
     'id': '1376979498704310372',
     'timestamp': '2025-05-27T17:44:55.533000+00:00'}]},
  {'question': {'author': 'afogel',
    'content': 'there was recently an article published that claims that there are systemic judgement errors within LLM-as-judge (Don’t Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation - https://arxiv.org/pdf/2505.16222). Presumably, IRR sufficiently controls for these issues?',
    'id': '1376979430555254924',
    'timestamp': '2025-05-27T17:44:39.285000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "we'll talk about IRR thursday 🙂",
     'id': '1377001906769105006',
     'timestamp': '2025-05-27T19:13:58.032000+00:00'}]},
  {'question': {'author': 'charlesc_78601',
    'content': 'why is it important to only run the judge once on the new data?',
    'id': '1376980162628812842',
    'timestamp': '2025-05-27T17:47:33.825000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'you can run it multiple times and take a majority vote. up to you!',
     'id': '1377002414955298836',
     'timestamp': '2025-05-27T19:15:59.193000+00:00'}]},
  {'question': {'author': 'wgpubs',
    'content': 'What is the next step if you notice in step2 the raw observed success rate over time is getting worse and worse .. or maybe even the oppositive, its getting unreasonably better and better?',
    'id': '1376980473120690329',
    'timestamp': '2025-05-27T17:48:47.852000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "the model behavior might be drifting!! foundation model providers push new updates to models all the time. it's so important to have evals you run on production traces for this reason. ideally the success rate goes up 🙂 but if it goes down, there's another failure mode to debug more",
     'id': '1377002759680819250',
     'timestamp': '2025-05-27T19:17:21.382000+00:00'}]},
  {'question': {'author': 'guilhermehge',
    'content': "What's the use of the training set? I thought we were going to place it in the prompt as examples, but it isnt\n\nAlso, since i'm not really training anything and each LLM call in singular i.e. if I make the call again, the results might change, how can these values be trusted?",
    'id': '1376980514673524927',
    'timestamp': '2025-05-27T17:48:57.759000+00:00'},
   'thread': [{'author': '__ii__i',
     'content': 'My understanding is the training set is used when drafting and tweaking the LLM judge prompt',
     'id': '1376980767657295955',
     'timestamp': '2025-05-27T17:49:58.075000+00:00'},
    {'author': 'davidh5633',
     'content': "The test set is labeled, so it will help you get these metrics on data that your model hasn't seen yet.",
     'id': '1376980957239840860',
     'timestamp': '2025-05-27T17:50:43.275000+00:00'},
    {'author': 'guilhermehge',
     'content': 'So we get examples from that training set for the prompt',
     'id': '1376980988613230702',
     'timestamp': '2025-05-27T17:50:50.755000+00:00'},
    {'author': 'wgpubs',
     'content': 'yah think of the training set as a means to improve the application prompt and/or the judge prompt ... so in a sense you\'re "fine-tuning" the prompts',
     'id': '1376981077809303552',
     'timestamp': '2025-05-27T17:51:12.021000+00:00'},
    {'author': 'guilhermehge',
     'content': 'Then, I run inferences on the validation set to get the TPR and TNRs',
     'id': '1376981261070893167',
     'timestamp': '2025-05-27T17:51:55.714000+00:00'},
    {'author': 'guilhermehge',
     'content': 'and then on the test set and get the metrics and do all the formulas, right?',
     'id': '1376981313189449899',
     'timestamp': '2025-05-27T17:52:08.140000+00:00'},
    {'author': 'guilhermehge',
     'content': 'But if I run the LLM again, the results might defer, even with low temperature',
     'id': '1376981367937695784',
     'timestamp': '2025-05-27T17:52:21.193000+00:00'},
    {'author': 'guilhermehge',
     'content': 'How can that be trusted?',
     'id': '1376981389919916263',
     'timestamp': '2025-05-27T17:52:26.434000+00:00'},
    {'author': 'wgpubs',
     'content': 'That sounds right to me ... as for trusting the results, I think that is where the human alignment bits come in and also the recommendation to run thru this process regualarly',
     'id': '1376981785375932458',
     'timestamp': '2025-05-27T17:54:00.718000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'yes the training set examples are used as few shot examples for the LLM Judge prompt\n\n> if I run the LLM again, the results might differ, even with low temperature\n\nthis is true. you can set temperature to 0, or you can take a majority vote across multiple trials so the results are more stable. if the judge prompt and evaluation criterion is well-specified, the variance should be pretty low. however if you want to measure the robustness even with LLM judge labeling uncertainty, you can adapt the bootstrapping process described in Section 4.6 in the course reader to relabel in every iteration (or a fraction of the iterations), and have a more robust confidence interval',
     'id': '1377003550416310303',
     'timestamp': '2025-05-27T19:20:29.908000+00:00'}],
   'replies': [{'author': 'bruno_petersen',
     'content': 'The way we have been doing this is to indeed run the same question multiple times to understand the variance on individual questions. If the results change a lot for the same questions we have a problem and need to get to predictability first. \n\nOver a large sample size, you should get a directionally correct idea in either case.',
     'id': '1376981212194668676',
     'timestamp': '2025-05-27T17:51:44.061000+00:00'}]},
  {'question': {'author': 'bingo_41343',
    'content': 'Could you please share any research paper that I can cite where we have these formulas to evaluate LLMs',
    'id': '1376980716688244878',
    'timestamp': '2025-05-27T17:49:45.923000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'no paper, see the course reader!',
     'id': '1377003619185983509',
     'timestamp': '2025-05-27T19:20:46.304000+00:00'}]},
  {'question': {'author': 'jodicasa',
    'content': 'Is this adjustment (in this use case, for AI evals) documented/published outside of the course reader?',
    'id': '1376980744722845696',
    'timestamp': '2025-05-27T17:49:52.607000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'not that i am aware of',
     'id': '1377003678963335248',
     'timestamp': '2025-05-27T19:21:00.556000+00:00'}]},
  {'question': {'author': 'arianpasquali',
    'content': 'This LLM-as-a-judge prompt learning/optimization process looks like a good use case for DsPY or Zenbase.ai.',
    'id': '1376980750523564082',
    'timestamp': '2025-05-27T17:49:53.990000+00:00'}},
  {'question': {'author': 'adityakabra9521',
    'content': "What if the unlabelled traces do not have the error for which we created the LLM as judge? Won't it be all pass?",
    'id': '1376981219677307012',
    'timestamp': '2025-05-27T17:51:45.845000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'yes it might say 100% success, but we correct this using our TPR and TNR on our test set according to the formula in section 4.6 in the course reader. the correction will drop the success rate',
     'id': '1377003915173953557',
     'timestamp': '2025-05-27T19:21:56.873000+00:00'}]},
  {'question': {'author': 'artste',
    'content': 'Are we now testing the accuracy of the model or of the judge?',
    'id': '1376981526419603598',
    'timestamp': '2025-05-27T17:52:58.978000+00:00'},
   'thread': [{'author': '.firepaul',
     'content': "I understand it as: we are estimating how confident we are in the judge's accuracy",
     'id': '1376981795601518703',
     'timestamp': '2025-05-27T17:54:03.156000+00:00'}]},
  {'question': {'author': '.firepaul',
    'content': 'for the bootstrap CI estimates, if our test set is, say, 40 examples, for the resampling, what amount of those 40 examples should be in each resample? Each resample should be a random 30 subset of the 40?',
    'id': '1376981539614756904',
    'timestamp': '2025-05-27T17:53:02.124000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'you can resample a set of 40. we resample with replacement (https://medium.com/data-science/understanding-sampling-with-and-without-replacement-python-7aff8f47ebe4), so the resampled set is diffirent in every iteration of the bootstrap',
     'id': '1377004218950352907',
     'timestamp': '2025-05-27T19:23:09.299000+00:00'}]},
  {'question': {'author': 'solomonogram',
    'content': 'Looks like the delta between observed and corrected very slowly goes to zero.',
    'id': '1376981743071920212',
    'timestamp': '2025-05-27T17:53:50.632000+00:00'}},
  {'question': {'author': 'bruno_petersen',
    'content': 'I also found this super helpful to understand the idea of trying to align the judge initially. \n\nWhat was not super clear to me in the lecture was that we want to do the alignment of the judge with Human labels first, in addition to focusing on adjusting for the bias',
    'id': '1376982095980920993',
    'timestamp': '2025-05-27T17:55:14.772000+00:00'},
   'thread': [{'author': 'bruno_petersen',
     'content': 'This is from: https://forestfriends.tech/',
     'id': '1376982200553312326',
     'timestamp': '2025-05-27T17:55:39.704000+00:00'}]},
  {'question': {'author': 'gwenc_35306',
    'content': 'Ahh I am more familiar with precision and recall. That comparison is helpful',
    'id': '1376982190214086676',
    'timestamp': '2025-05-27T17:55:37.239000+00:00'}},
  {'question': {'author': '.firepaul',
    'content': 'why are the corrected `theta` estimates so stable in that increasing TNR plot? I would think those would jump around more',
    'id': '1376982230143864943',
    'timestamp': '2025-05-27T17:55:46.759000+00:00'}},
  {'question': {'author': 'darkwrath.',
    'content': 'if TPR % and TNR % are both 100%, why is the success rate 80%?',
    'id': '1376982442161737771',
    'timestamp': '2025-05-27T17:56:37.308000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'ah success rate is how often the traces pass the criterion evaluated by the LLM Judge\n\nthe TPR and TNR are how accurate the LLM Judge is at evaluating the criterion',
     'id': '1377004452740862073',
     'timestamp': '2025-05-27T19:24:05.039000+00:00'},
    {'author': 'hamelh',
     'content': '<@446164230463553537> the 80% is the underlying success rate of the application, NOT the llm as a judge.  Conversely, TNR/TPR  measures how well your LLM as a judge can do its job. \n\nSuccess Rate = Your AI product (e.g. a learning assistant) makes mistakes 20% of the time and is successfull 80% of the time\n\nTPR = 100% your judge has perfect ability to recall successes (_Of all the actual successes, how many did we correctly label as successes?_)\nTNR = 100% your judge has perfect ability to recall  errors (_Of all the actual failures, how many did we correctly label as failures?_)\n\nBecause TNR=TPR=100% your judge is perfect, so no correction is needed in that scenario.   \n\nHere is a spreadsheet with a simulation of TNR=TPR=100%, Success Rate =80%  incase it is helpful \n\n(cc: <@893327214685343804> )\n\nhttps://docs.google.com/spreadsheets/d/1JNRwoBCcG0Ho65vWheo8NQ6LKeIkVNIZeNWH3RLf8UA/edit?usp=sharing',
     'id': '1377034293632831590',
     'timestamp': '2025-05-27T21:22:39.662000+00:00'}],
   'replies': [{'author': 'bruno_petersen',
     'content': 'the 80% is the "true" success rate - we are trying to get to know that ground truth.',
     'id': '1376982704356196374',
     'timestamp': '2025-05-27T17:57:39.820000+00:00'},
    {'author': 'darkwrath.',
     'content': 'it would mean 100% of positives are identified; 100% of negative are identified.',
     'id': '1376982717857660988',
     'timestamp': '2025-05-27T17:57:43.039000+00:00'}]},
  {'question': {'author': 'solomonogram',
    'content': "It's amazing how the corrections look really good, really close to 80% 🙂",
    'id': '1376982537980612678',
    'timestamp': '2025-05-27T17:57:00.153000+00:00'},
   'thread': [{'author': '.firepaul',
     'content': 'totally, that is wild to me',
     'id': '1376982605454377012',
     'timestamp': '2025-05-27T17:57:16.240000+00:00'}]},
  {'question': {'author': 'siddani09',
    'content': 'is anyone feeling overwhelmed, i was aware of many of these concepts, but dont know clearly where do i get started?',
    'id': '1376982684475064392',
    'timestamp': '2025-05-27T17:57:35.080000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'start reading chapter 4 in the course reader and ask questions along the way 🙂',
     'id': '1377004612069888000',
     'timestamp': '2025-05-27T19:24:43.026000+00:00'}],
   'replies': [{'author': 'rawwerks',
     'content': 'ask chatgpt 😉',
     'id': '1376982782319923210',
     'timestamp': '2025-05-27T17:57:58.408000+00:00'}]},
  {'question': {'author': '.firepaul',
    'content': 'fwiw, when you are reading back through the Discord later <@893327214685343804> , this material is awesome and I appreciate you teaching it!',
    'id': '1376982724459626559',
    'timestamp': '2025-05-27T17:57:44.613000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'thank you!!',
     'id': '1377004942753267974',
     'timestamp': '2025-05-27T19:26:01.867000+00:00'}]},
  {'question': {'author': 'bmoney1267',
    'content': 'I\'ve spent the past 6 months at work struggling with this "how well is our LLM classifier / LLM-as-judge working, can we calibrate with human annotations". The Rogan-Gladen estimator didn\'t really work very well for me as a correction. I dove into calibration methods like Platt scaling and and isotonic regression, which have been more helpful for what I\'m doing.',
    'id': '1376982748551450786',
    'timestamp': '2025-05-27T17:57:50.357000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "Thanks for sharing this—it's a good instinct to look into calibration, but there's a distinction that might help clarify.\n\nIf your LLM-as-Judge outputs a score or probability (e.g., 0–1 or 0–100), then yes, applying a calibrator like Platt scaling or isotonic regression can help make those scores more reflective of true likelihoods! So if you're using a correction, that correction should be applied after converting your calibrated scores into binary predictions (e.g., thresholding at 0.5), and after measuring the judge's accuracy (TPR/FPR) on a human-labeled test set.\n\nIf instead your judge is directly outputting 0 or 1 (like in the setup we taught in class), then calibration doesn’t apply, and you should just go straight to estimating accuracy and applying correction",
     'id': '1377007141352767509',
     'timestamp': '2025-05-27T19:34:46.054000+00:00'},
    {'author': 'bmoney1267',
     'content': "Yeah, the issue was that we have a lot of low prevalence categories. So when sampling chats for human annotation we have to upsample using the LLM-as-Judge to focus on categories it thinks were picked up. This means I have a biased sample for prevalence, which calibration helps correct for (after fitting the isotonic regression, I can predict with that model on random unlabeled samples). Without it, I'd have to grab a much larger random sample to get enough signal for the least prevalent categories. When I tried using Rogan-Gladen, it was unstable or giving me huge numbers (e.g. 232% prevalence). I know you recommend clipping between 0-1, but I think it's more an issue with the quality of our classifier (pretty low precision).",
     'id': '1377042394570293349',
     'timestamp': '2025-05-27T21:54:51.076000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'interesting -- yeah that is tough, if your TPR and TNR are _really_ off, or if your LLM Judge identified success rate is ~100%, the estimation can be >> 1',
     'id': '1377167362075267193',
     'timestamp': '2025-05-28T06:11:25.651000+00:00'}]},
  {'question': {'author': '__ii__i',
    'content': "A thought on how to chart these types of metrics - XMR with predefined limits (using bootstrap) - so you don't jump at shadows (i.e. success rate moves a few percentage point within limits). A longer discussion: https://commoncog.com/becoming-data-driven-first-principles/#the-trick",
    'id': '1376982788204658779',
    'timestamp': '2025-05-27T17:57:59.811000+00:00'},
   'replies': [{'author': 'bruno_petersen',
     'content': 'can also recommend this post https://entropicthoughts.com/statistical-process-control-a-practitioners-guide',
     'id': '1376983277507706941',
     'timestamp': '2025-05-27T17:59:56.470000+00:00'}]},
  {'question': {'author': 'brwndynamight',
    'content': 'not sure if this has been asked, are the guest speaker meetings also recorded and posted?',
    'id': '1376982955469049969',
    'timestamp': '2025-05-27T17:58:39.690000+00:00'}},
  {'question': {'author': 'nate1363',
    'content': "for the precision/ recall balance, it's a really interesting point on the trade offs for each -- it's ultimately a business decision on which bias that it's okay to lean into - especially for LLM-as-judges.",
    'id': '1376983005838704831',
    'timestamp': '2025-05-27T17:58:51.699000+00:00'},
   'thread': [{'author': 'nate1363',
     'content': "in the e-commerce context, at my company generally try to bias towards to precision to not 'over-merchandise' products (exaggerate or over promise).\n\nSo for llm-as-judges and other QA practices, I normally bias towards precision since it's okay to under specify product capabilities -- versus the other side of it which leads to upset customers",
     'id': '1376983705373114471',
     'timestamp': '2025-05-27T18:01:38.481000+00:00'}]},
  {'question': {'author': 'noire73',
    'content': 'Do you think we should have more than one type of LLM-Judge, like 4o, Gemini, Lllam, etc, or do you go with the same LLM that you use in production?',
    'id': '1376983686834159728',
    'timestamp': '2025-05-27T18:01:34.061000+00:00'},
   'thread': [{'author': 'nate1363',
     'content': "I've seen cases for some of the more complex judges where the LLM judge prompt is set to multiple frontier LLMs -- and then theres a ensemble/ voting on if it's a pass or fail",
     'id': '1376984404609728522',
     'timestamp': '2025-05-27T18:04:25.192000+00:00'},
    {'author': '907resident',
     'content': 'Here is a thread to a similar discussion: \nhttps://discord.com/channels/1368666390185246770/1376979535110738050',
     'id': '1376990977272844471',
     'timestamp': '2025-05-27T18:30:32.237000+00:00'}],
   'replies': [{'author': 'bruno_petersen',
     'content': "I think the model doesn't matter as much as ensuring that the judge is consistent. Running a known test set of queries will help you understand whether things are consistent. \n\nIf you start noticing that the judge is suddenly misclassifying known questions you start to have a problem. \n\nWhat matters is that you get close to the ground truth of what human labellers would do.",
     'id': '1376985244875362467',
     'timestamp': '2025-05-27T18:07:45.527000+00:00'}]},
  {'question': {'author': 'markmanolas_96500',
    'content': 'Hey <@525830737627185170>, will the guest speaker sessions be recorded?',
    'id': '1376983770074189985',
    'timestamp': '2025-05-27T18:01:53.907000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'Yes everything is recorded.  Always recorded.  All sessions are recorded!',
     'id': '1399856227399372830',
     'timestamp': '2025-07-29T20:48:52.683000+00:00'}],
   'replies': [{'author': 'siddani09',
     'content': 'i can attend the first one, but may be missing the next one.',
     'id': '1376984025507561634',
     'timestamp': '2025-05-27T18:02:54.807000+00:00'}]},
  {'question': {'author': 'dvasquez5769',
    'content': 'I didn\'t understand the conexion between this lecture and the previous one. If I understood correctly, in the previous lecture we ended up with a labeled database where in the end, the labels were some kind of categories generated by axial coding.\n\nIn today\'s lecture we need a labeled database for training and evaluating the LLM-as-judge.\n\nAre the database in the previous lecture and the database in today\'s lecture related?\n\nShould we continue labeled the database in the previous lecture with "PASS" and "FAILED" and use it for training and testing the LLM-as-judge?',
    'id': '1376995343773732934',
    'timestamp': '2025-05-27T18:47:53.292000+00:00'}},
  {'question': {'author': 'afogel',
    'content': 'I know that we teased this idea, but I would love to hear more about building processes to measure and manage data drift and how that impacts evaluator efficacy',
    'id': '1377002348521586808',
    'timestamp': '2025-05-27T19:15:43.354000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "we'll talk about it in CI/CD lecture next week 🙂",
     'id': '1377007386841321483',
     'timestamp': '2025-05-27T19:35:44.583000+00:00'},
    {'author': 'hamelh',
     'content': 'We also have a guest lecture from someone at Grok that works extensively with CI/CD and testing 🎉',
     'id': '1377020229028286584',
     'timestamp': '2025-05-27T20:26:46.399000+00:00'}]},
  {'question': {'author': 'labdmitriy',
    'content': "<@525830737627185170> <@893327214685343804> Thanks a lot for the lesson!\nI have two questions:\n - I saw on the slide that we can use 2-3 examples for each case (bad/good) as few shot examples, but on the same slide you mention that training set can be about 20 rows - is it larger in size that 4-6 few shot examples to be able to choose more representative examples later?\n- What are your thoughts/recommendations about how to correctly split dataset on train/dev/test datasets to be representative as much as possible, especially if we don't have labels and can't make something like stratified splitting?",
    'id': '1377014499651096586',
    'timestamp': '2025-05-27T20:04:00.409000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': "I actually don't know how many examples you need as few-shot examples. Think of few-shot examples as additional instructions to your LLM. You're trying to show it something that's hard to describe with a prompt. Each example should add something different in terms of your exposition of the way you would like a model to behave. Sometimes I add 5 examples, sometimes I have 25 examples.  It takes a bit of experimentation to get correct at times. \n\nWe were just going off of the rough guidance of having a hundred traces, in which case 20% of them are in the training set, 40% in the dev set, and 40% in the test. \n\nYou really should be able to get a hundred labels. As you get more data, you can ramp up the complexity and do things like stratified sampling inside various dimensions and so forth. But start simple",
     'id': '1377020004796465293',
     'timestamp': '2025-05-27T20:25:52.938000+00:00'},
    {'author': 'labdmitriy',
     'content': 'Thank you!',
     'id': '1377026816392560652',
     'timestamp': '2025-05-27T20:52:56.949000+00:00'}]},
  {'question': {'author': 'rodocite',
    'content': 'is the homework designed to get us to really understand how to calibrate our llm-as-judge success rate (TPR/TNR) ? i had trouble with that part of the lecture.',
    'id': '1377049743674314815',
    'timestamp': '2025-05-27T22:24:03.239000+00:00'}},
  {'question': {'author': 'nic.02539',
    'content': "Thank you <@525830737627185170> and <@893327214685343804> for introducing FastHTML and Phoenix during this workshop. I've been exploring AI eval frameworks and notices Weights & Biases also offers W&B Weave (wandb.ai/site/weave) for monitoring and evaluating AI apps. Since I haven't personally used any of these frameworks yet, I wonder if you could provide additional insights on the main differences, specific strengths, and potential limitations (e.g. learning curve) for each of these tools/frameworks?",
    'id': '1377053815190061057',
    'timestamp': '2025-05-27T22:40:13.964000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': "At the end of the day, you have to try them and see which one you like. They all change so fast that it's kind of futile to write a blog post that compares all of them fairly. \n\nWe will have a guest speaker from Weight & Bias later on in the course. Although they will not be speaking much about weight and biases. \n\nMy advice is to figure out what your requirements are. For example, do you need something that's open source or are you willing to pay? Can you send your data to a hosted solution or do you need it on-prem? Do you need a Python SDK or is it okay to only have TypeScript? Are you using Langchain? Etc.",
     'id': '1377056215066083328',
     'timestamp': '2025-05-27T22:49:46.139000+00:00'},
    {'author': 'hamelh',
     'content': 'Based on your answers to those questions, that will help you narrow down the space of what tools are feasible for you.',
     'id': '1377056282028150886',
     'timestamp': '2025-05-27T22:50:02.104000+00:00'},
    {'author': 'hamelh',
     'content': 'In my consulting engagements, I direct my clients towards a few choices that I have lots of experience with. And I also have relationships with particular vendors. However, this course is not about advertising a particular vendor, so that is why I am trying to stay neutral.',
     'id': '1377056638841786410',
     'timestamp': '2025-05-27T22:51:27.175000+00:00'},
    {'author': 'nic.02539',
     'content': "Makes absolutely sense, thanks Hamel. I'd need a python SDK (I'm not very familiar with TypeScript), don't use Langchain (but LlamaIndex), prefer hosted solutions and would like to keep costs low (open source would be most ideal). Given this context, would you be able to recommend one or two that fits these criteria best? To get started?",
     'id': '1377058770303062088',
     'timestamp': '2025-05-27T22:59:55.355000+00:00'},
    {'author': 'hamelh',
     'content': 'If you want OSS phoenix my favorite',
     'id': '1377061401499140136',
     'timestamp': '2025-05-27T23:10:22.681000+00:00'},
    {'author': 'hamelh',
     'content': 'For paid - Braintrust, Langsmith, Arize (Paid), W&B are all decent options.',
     'id': '1377061566037364737',
     'timestamp': '2025-05-27T23:11:01.910000+00:00'},
    {'author': 'hamelh',
     'content': "It's hard to narrow it down more without learning a lot more about you and your application",
     'id': '1377064070053761034',
     'timestamp': '2025-05-27T23:20:58.914000+00:00'}]},
  {'question': {'author': 'nic.02539',
    'content': '<@893327214685343804> <@525830737627185170>, on the topic of calculating TPR & TNR as part of the error analysis, would it make sense to also use F1 scores (the harmonic mean of precision & recall) as a metric in this context?',
    'id': '1377056058081804399',
    'timestamp': '2025-05-27T22:49:08.711000+00:00'},
   'replies': [{'author': 'sh_reyashankar',
     'content': "F1 cannot substitute for TPR and TNR directly but you can rework the formula to use F1 instead of TPR and TNR if you'd like...definitely some math & derivation to this",
     'id': '1377167673946804225',
     'timestamp': '2025-05-28T06:12:40.007000+00:00'}]},
  {'question': {'author': 'cheenu_29255',
    'content': '<@893327214685343804> <@525830737627185170> - would it be possible to post the slides and the recordings for all the sessions in a common location? Thanks',
    'id': '1377073431849205882',
    'timestamp': '2025-05-27T23:58:10.940000+00:00'},
   'replies': [{'author': 'hamelh',
     'content': 'The common location is maven.  Its already there.  It will always be there within 48 hours.  But its already there now!',
     'id': '1377073661411852489',
     'timestamp': '2025-05-27T23:59:05.672000+00:00'}]},
  {'question': {'author': 'noire73',
    'content': 'does anyone have the link to the dropbox for the new chapters',
    'id': '1377179071808540682',
    'timestamp': '2025-05-28T06:57:57.469000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'What is your email? Are you getting the messages through the course platform',
     'id': '1377257783631810572',
     'timestamp': '2025-05-28T12:10:43.830000+00:00'},
    {'author': 'noire73',
     'content': "i do get the event notification, but i haven't received anything course material.",
     'id': '1377259763519197267',
     'timestamp': '2025-05-28T12:18:35.872000+00:00'},
    {'author': 'hamelh',
     'content': 'See the announcments in student portal',
     'id': '1377259973427593228',
     'timestamp': '2025-05-28T12:19:25.918000+00:00'}]},
  {'question': {'author': 'jav_sc',
    'content': 'Hi <@893327214685343804> , hi <@525830737627185170> , \nI have some questions regarding the LLM-as-Judge development/"training" cycle from Lesson 3: \n- Regarding splitting of data (slide: “Split Your Labeled Data”), the explanations are provided for the LLM-as-Judge lesson, but I understand we should split the data both for the Agent Development and well as for LLM-as-Judge development, right? Are you referring to the same labeled data (for Agent dev and for LLM-as-Judge dev?) or these are different sets of labelled data? I was confused because to properly calibrate the judge (calculate the TPS and TNR) we will need examples of only the failure mode “Helpful response”, where the input is an ambiguous query and then the output should be a follow-up question to clarify, right? \n- Assuming these datasets are two different ones (one for Agent Development and one for LLM-as-Judge of amiguous inputs), does this mean we will have several datasets in our evals, and this dataset will be only for “Ambiguous inputs”, and then we will build metrics specific to those failure modes? I can imagine some metrics being applicable to every trace (e.g. JSON Schema correctness) and other metrics such as "Correct_answer_to_ambiguous_questions" being only applicable to the inputs that trigger this failure mode. \n- The split criteria is a randomized split or should be targeted split for balance between pass and fail labels? \n\nIf the question is confusing I can record a short video explaining what I mean.',
    'id': '1377206274420117525',
    'timestamp': '2025-05-28T08:46:03.077000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Hey thanks for the question. Yes you should split data for agent development if you have labeled data for the agent, but the data is going to be different and labeled differently because your agent and LLM Judge are doing two different things.\n\nhere\'s an example, in the context of Recipe Bot:\n\nthe agent is Recipe Bot. You might not have labeled data here. If you do, the labeled data will be examples of recipes that are good, overall (that pass all your criteria).\n\nthe LLM Judge evaluates a specific criteria. maybe this criteria is "helpfulness if user submits an ambiguous query." the labeled data here is yes/no or pass/fail on the traces, specific to whether the trace represents a helpful conversation. TPR and TNR make sense in the binary setting, or LLM Judge setting, not necessarily Recipe Bot.',
     'id': '1377506898571694153',
     'timestamp': '2025-05-29T04:40:37.460000+00:00'}]},
  {'question': {'author': 'rawwerks',
    'content': '<@893327214685343804> <@525830737627185170>  - could you elaborate on this note from the book? \n\n“Cohen’s Kappa is intended for measuring agreement\nbetween two human annotators who are peers. It is not used to evaluate\nLLM-as-Judge outputs against human labels.”',
    'id': '1377381795422539827',
    'timestamp': '2025-05-28T20:23:30.545000+00:00'},
   'thread': [{'author': 'rawwerks',
     'content': 'From a purely statistical perspective, it seems equally valid.',
     'id': '1377381995629252661',
     'timestamp': '2025-05-28T20:24:18.278000+00:00'},
    {'author': 'rawwerks',
     'content': '(Also - my AI agent is upset that you are discriminating against non-humans.)',
     'id': '1377382214555275409',
     'timestamp': '2025-05-28T20:25:10.474000+00:00'},
    {'author': 'sh_reyashankar',
     'content': "This is a good question. \n\nCohen's Kappa is used to quantify the **agreement** between two raters, typically in a subjective situation. Cohen's Kappa measures agreement with respect to the chance that the two raters would agree if they were random variables.\n\nYou can use Cohen's Kappa to quantify whether LLM Judges agree with each other, or, if between LLM Judge and human judge. This will tell you how likely the human and LLM Judge are to agree. However in practice i see most people misuse Cohen's Kappa. **If your human ratings are the ground truth, you should treat the LLM Judge as a classifier that you are trying to align with the ground truth, **and follow standard ML metrics like TPR and TNR as we have been teaching in the course.",
     'id': '1377507930022088826',
     'timestamp': '2025-05-29T04:44:43.377000+00:00'}]},
  {'question': {'author': 'joker_73329',
    'content': "Do you find the 'system' role has a significant impact on how a prompt is interpreted by the model? Working on custom instructions -like feature, the user preferences are curently formatted as a user message *after* the actual user message. the ordering seems like an easy change (not sure if worth evaluating), but also intuitively seems like preferences should just be in a system message but not sure how much this matters in practice.",
    'id': '1377402343683002368',
    'timestamp': '2025-05-28T21:45:09.632000+00:00'},
   'replies': [{'author': 'rawwerks',
     'content': 'Depends on the model but the short answer is: yes, many models treat system prompts totally differently.\n\nOne interesting example is in aider, additional system prompts are sent to remind the LLM of specific formatting requirements. (After an initial system prompt and the few shot examples.) \n\nYour question made me think of this because it’s rare to find multiple system prompts, but it’s possible and it happens. ( although there are a few api’s will simply reject more than one system prompt)',
     'id': '1377423185519185922',
     'timestamp': '2025-05-28T23:07:58.713000+00:00'}]},
  {'question': {'author': 'stibbs.',
    'content': 'Just a heads up that the recording has several minutes of you chatting ahead of the lesson starting',
    'id': '1377441718898528302',
    'timestamp': '2025-05-29T00:21:37.415000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': '<@525830737627185170>',
     'id': '1377505338378551398',
     'timestamp': '2025-05-29T04:34:25.481000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'thanks for letting us know!',
     'id': '1377505365079490590',
     'timestamp': '2025-05-29T04:34:31.847000+00:00'}]},
  {'question': {'author': 'outlierhunter',
    'content': '<@893327214685343804>  I have a question regarding the "Split Your Labeled Data" slide.\n\nSince we\'re refining the prompt, the "Dev set" feels more like a "Train set" to me.\n\n**Questions:**\n\n1. I would appreciate it if you could elaborate a bit more on what makes a set specifically a ‘*training set’* in this context.\n\n2. I’d also be grateful if you could help me confirm whether I’m understanding the concept of a ‘training set’ correctly. For example, if I select 3 examples where the LLM judge made accurate decisions, and 3 where it made incorrect ones, would these 6 examples be considered the training set? And would they then be used as few-shot examples in the prompt for the LLM judge?',
    'id': '1377488153111429201',
    'timestamp': '2025-05-29T03:26:08.194000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Thanks for the questions! Here\'s my attempt at anwsers\n\n> I would appreciate it if you could elaborate a bit more on what makes a set specifically a ‘training set’ in this context.\n\nMaybe "training" set is not the right term, because we are not necessarily changing the model parameters / doing parametric training. However, we are using this training set to select the examples that should be in the LLM Judge prompt. When I say training set, i mean anything in this dataset is fair game to include in your LLM Judge prompt.\n\n> For example, if I select 3 examples where the LLM judge made accurate decisions, and 3 where it made incorrect ones, would these 6 examples be considered the training set? And would they then be used as few-shot examples in the prompt for the LLM judge?\n\nThe training set might have more than 6 examples. You\'ll just select a fraction of them to include in your LLM Judge prompt --- so that your prompt gives good TPR and TNR on the dev set. Examples in the prompt = few shot examples.',
     'id': '1377508742643449916',
     'timestamp': '2025-05-29T04:47:57.121000+00:00'},
    {'author': 'outlierhunter',
     'content': 'Thanks for your answer. So, the training set consists of candidate examples that may be included as few-shot examples in the LLM Judge prompt. Am I understanding this correctly?',
     'id': '1377542650445693009',
     'timestamp': '2025-05-29T07:02:41.371000+00:00'}]},
  {'question': {'author': 'rawwerks',
    'content': 'i know we\'re not supposed to get excited about tools but just found out that with DSPy.SIMBA you can combine qualitative and quantitative evals.\n\n`dspy.Prediction(score=0.0, feedback="....")`\n\nhttps://discord.com/channels/1161519468141355160/1365203930430181386',
    'id': '1377692653860163715',
    'timestamp': '2025-05-29T16:58:44.972000+00:00'},
   'thread': [{'author': 'rawwerks',
     'content': "but don't worry, it's only in service of helping me look at my data...",
     'id': '1377692802749567087',
     'timestamp': '2025-05-29T16:59:20.470000+00:00'},
    {'author': 'rawwerks',
     'content': 'DSPy.SIMBA = quantitative + qualitative evals',
     'id': '1377692942684127302',
     'timestamp': '2025-05-29T16:59:53.833000+00:00'},
    {'author': 'rawwerks',
     'content': 'https://dspy.ai/tutorials/tool_use/?h=simba',
     'id': '1377694793374629938',
     'timestamp': '2025-05-29T17:07:15.072000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'That’s awesome',
     'id': '1377827148232130650',
     'timestamp': '2025-05-30T01:53:10.929000+00:00'}]},
  {'question': {'author': 'lobot6',
    'content': 'In Homework 3 in the labeled_traces.csv file there is a new column introduced called Confidence with values LOW, MEDIUM, and HIGH.  What is this column?  I don’t think we talked about this in the labeling phase in the lectures or the reading.',
    'id': '1377745577197961387',
    'timestamp': '2025-05-29T20:29:02.879000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': '<@456226577798135808>',
     'id': '1378127647317557260',
     'timestamp': '2025-05-30T21:47:15.494000+00:00'}]},
  {'question': {'author': 'joker_73329',
    'content': 'not sure if surprising, finding gpt-4.1 to be a better grader than o3-mini and even o1. maybe just not prompting well for a reasoning model.',
    'id': '1378122634859909141',
    'timestamp': '2025-05-30T21:27:20.431000+00:00'}},
  {'question': {'author': 'dizzydapper',
    'content': 'For LLM-as-judge evaluations, does it matter if I use the same model for both production and evaluation? For example, if my AI product uses GPT-4, would there be any issues with also using GPT-4 as the judge model?\n\nAre there any criteria for selecting which model to use as a judge? For example, if the judge model has to be smarter and/or bigger compared to the model we are using in production , etc.',
    'id': '1378422067833339904',
    'timestamp': '2025-05-31T17:17:10.815000+00:00'},
   'replies': [{'author': 'hamelh',
     'content': 'https://hamel.dev/blog/posts/evals-faq/#q-can-i-use-the-same-model-for-both-the-main-task-and-evaluation',
     'id': '1378434673990172683',
     'timestamp': '2025-05-31T18:07:16.357000+00:00'},
    {'author': 'hamelh',
     'content': 'https://hamel.dev/blog/posts/llm-judge/#what-model-do-you-use-for-the-llm-judge',
     'id': '1378434787383185479',
     'timestamp': '2025-05-31T18:07:43.392000+00:00'}]},
  {'question': {'author': 'pauloleary_34506',
    'content': "Apologies if this has been asked before. Would you use this approach (identifying dimensions, generating prompts, open coding, evaluators) for any Production use of an LLM, even if it's doing something relatively simple? For example, our product has a function that takes several fields of user-inputted text and summarises them into a single string. The prompt is currently only six or seven sentences long, and we're using a standard model with no tools. If you were tasked with ensuring that the LLM is and remains Production-ready, would you use this full approach here, or a simplified version, or is it not appropriate?",
    'id': '1378797937441636465',
    'timestamp': '2025-06-01T18:10:45.113000+00:00'},
   'replies': [{'author': 'sh_reyashankar',
     'content': 'I would do some form of systematic synthetic data generation and error analysis before shipping the product. Otherwise you will ship something totally blind. I would also have at least one LLM Judge evaluator for the most important failure mode I identify in error analysis, so I can quickly run it against production traces and iterate in production very fast\n\nCc <@525830737627185170>',
     'id': '1379289526740058265',
     'timestamp': '2025-06-03T02:44:09.135000+00:00'},
    {'author': 'intellectronica',
     'content': "Some thoughts from my experience:\n1. No need to be fanatical, if this isn't critical / user-facing, sometimes it's OK to just wing it. But I learnt the hard way that if users and stakeholders depend on it in production, you really want to test.\n2. If it's simple then the eval set and harness is also simple. You can quite easily generate synthetic data and then continue updating the set as you get examples from production. Especially if anything goes wrong. And if the output string is predictable, you don't even need an LLM judge, you can do string comparison (for example), or a simple text comparison method.\n3. It's worth going through an exercise of thinking defensively - what might go wrong? For example, since it's taking in user-inputted text ... what kinds of weird inputs might trip the implementation? What happens if the text is in the wrong language? If it's misspelled? If it contains confusing instructions to the LLM, etc ...",
     'id': '1379862777375494306',
     'timestamp': '2025-06-04T16:42:02.737000+00:00'}]},
  {'question': {'author': 'sugilauw_28419',
    'content': 'Hi <@893327214685343804> <@525830737627185170>, in Homework 3 you guys created label_data.py to label ground truth. Is that your way of automating open coding to do it at scale?',
    'id': '1379083303218450484',
    'timestamp': '2025-06-02T13:04:41.616000+00:00'},
   'replies': [{'author': 'sh_reyashankar',
     'content': 'Sometimes I use gpt 4o to label traces, however for any traces I keep as part of my LLM judge alignment sets, I always hand validate every ground truth label and correct the wrong ones myself. For higher stakes applications (eg for some of the police records work in UC Berkeley) there’s no gpt assistance',
     'id': '1379289027911356509',
     'timestamp': '2025-06-03T02:42:10.205000+00:00'}]},
  {'question': {'author': 'brwndynamight',
    'content': "when it comes to synthetic queries, is that mostly useful when you don't have a wealth of Production data regarding user queries? Does it make sense to create synthetic queries for your training set and then test against actual data as your developer or unbiased set? Or is it better practice to sift through actual data and use that for this process instead?",
    'id': '1379532449154990140',
    'timestamp': '2025-06-03T18:49:26.352000+00:00'},
   'replies': [{'author': 'sh_reyashankar',
     'content': 'If you have production data that is ideal — in the ideal world you would never have to generate any synthetic data! Prod data can be used for train dev and test , as long as there’s no overlap or leakage, meaning no query exists in more than one split',
     'id': '1379639642282000517',
     'timestamp': '2025-06-04T01:55:23.185000+00:00'}]},
  {'question': {'author': 'moonii1250',
    'content': 'Are there any python libraries you recommend that makes it easy for folks without data science background to calculate metrics such as TPR/TNR, confidence interval. I know the script(judgy) you provided leverages numpy but myself I am not a numpy arrays expert. I would love to just take LLM judges output file and give to a script and get these calculations',
    'id': '1379836982162886857',
    'timestamp': '2025-06-04T14:59:32.679000+00:00'},
   'replies': [{'author': 'intellectronica',
     'content': "A bit of a tangent, but one thing I sometimes like doing, instead or before using a library, is ask AI to show me how to do a simple implementation. Even if I end up picking up a library later, I learn something on the way so the concept is a bit clearer to me. And often I'm surprised to find that the code is actually simple and readable / understandable. See, for example https://chatgpt.com/share/68407510-5268-800e-8f8a-17eb2a0117bc",
     'id': '1379853586896392282',
     'timestamp': '2025-06-04T16:05:31.556000+00:00'}]},
  {'question': {'author': 'sugilauw_28419',
    'content': "hey <@893327214685343804> <@525830737627185170> If i have some false positives in my dev set (Homework 3), do i take that false positive from the dev set and use that in my judge's prompt to correct it (as few-shot examples)? And also remove that example from my dev set?",
    'id': '1380428481380487229',
    'timestamp': '2025-06-06T06:09:57.082000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'No only from the training set try to find a similar example from your training set that is an exemplar for the failure in the dev set.  \n\nAlso, can you clarify your prompt to do better on dev set.  Its best not to "train on test/dev set" 😅 -> I get what you are asking, its just a slipperly slope if you go this direction. \n\ncc: <@893327214685343804>',
     'id': '1380429344245153855',
     'timestamp': '2025-06-06T06:13:22.805000+00:00'},
    {'author': 'sugilauw_28419',
     'content': 'I’ve inspected the dev-set errors and uncovered a failure mode that isn’t represented anywhere in my current training split.\n\nGiven our goal of keeping the dev set completely unseen by the judge prompt, would you prefer that I:\n\n1) Move that single dev example into the training set and replace it with a fresh, unlabeled example so the split boundaries stay clean? or\n\n2) Collect (or synthesize) new traces that exhibit the same failure pattern, label them, and add those to the training pool instead?\n\nEither approach would let me include a representative few-shot example in the prompt without leaking dev data. Which one would you recommend?',
     'id': '1381778499764879481',
     'timestamp': '2025-06-09T23:34:26.547000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'Would do (2) so you aren’t just rearranging train and dev to pass dev…even if you rearrange, perhaps there is something in the new dev set that is also not covered in train 🤣',
     'id': '1381809525207732379',
     'timestamp': '2025-06-10T01:37:43.589000+00:00'}]},
  {'question': {'author': 'roshie0465',
    'content': "I have a question regarding the plot in lesson 3 slide 35 (estimated success rate vs number of labeled examples). The red dots appear to be the primary observation by the judge which seems to be fixed and the same parameter is used in the success rate estimate formula. Since it is fixed I am assuming it is not zero even at zero number of examples. So how come the blue dots (corrected theta) is 100% at zero number of samples? My calculations doesn't add up.",
    'id': '1381901058485391461',
    'timestamp': '2025-06-10T07:41:26.823000+00:00'},
   'thread': [{'author': 'intellectronica',
     'content': "I think it's a bit confusing because the beginning of the blue line is partially hidden by the legend. Hamel and Shreya may have a clearer explanation of what the plot signifies exactly, but my understanding in general, when you have no labelled examples at all, you can't calculate, so you're at 100% with no confidence, and as you start adding real samples and plug them into the formula, you can start correcting, and the first X, marking a corrected theta, is with few examples (maybe 10?), and then as you increase the number of examples it starts converging towards 80%.",
     'id': '1381913246641946696',
     'timestamp': '2025-06-10T08:29:52.706000+00:00'},
    {'author': 'intellectronica',
     'content': '<@525830737627185170> <@893327214685343804> ^^^^^',
     'id': '1381913647307034717',
     'timestamp': '2025-06-10T08:31:28.232000+00:00'},
    {'author': 'roshie0465',
     'content': 'This is what I thought at first, but I realized the primary observation (P_obs is a parameter in the corrected success rate formula) is k/m which is defined as  "running the judge on m **new, unlabelled traces** and let k be the number it labels “Pass.”" (according to the reader).  So seems to me it has a value from the beginning since it is not dependent on the number of labelled examples.',
     'id': '1381921064522747904',
     'timestamp': '2025-06-10T09:00:56.634000+00:00'},
    {'author': 'intellectronica',
     'content': "So you can calculate p_obs but for the corrected value you need TPR/TNR which depend on having samples to compare to. See the shaded confidence interval, I think it's more helpful than the blue line at the very beginning.",
     'id': '1381928265769422849',
     'timestamp': '2025-06-10T09:29:33.545000+00:00'},
    {'author': 'pastor1571',
     'content': "I think it doesn't reach the zero, it's like a t distribution when you label 1 example you have high variability but as you label more examples the judge is better at generalizing therefore reach to the true success rate with lower uncertainty",
     'id': '1381962189853823047',
     'timestamp': '2025-06-10T11:44:21.677000+00:00'},
    {'author': 'pastor1571',
     'content': 'So, if you label 1 example and that example pass it will be 100%. See the line that has the zero, it was never reach, so I think we can assume k > 0',
     'id': '1381963377932238848',
     'timestamp': '2025-06-10T11:49:04.937000+00:00'},
    {'author': 'roshie0465',
     'content': "Yes, I don't think k is zero and thanks for pointing out the zero line, you are right it is never reached.\nI am mostly trying to understand the graph with respect to the estimated success rate formula. Given TPR and TNR values on the graph, for  Corrected theta to be 100 (with very few samples, maybe 10), the only way is p_obs=0.9, i.e. (0.9+0.85-1)/(0.9+0.85-1).\n\nAlso Fixed TPR and TNR is a bit confusing here, aren't they going to change when number of examples increase, otherwise how the corrected theta is going to change?",
     'id': '1381981388240191568',
     'timestamp': '2025-06-10T13:00:38.929000+00:00'},
    {'author': 'pastor1571',
     'content': 'TPR is the number of true positive rate, so we have 90% of the pass correctly classify as pass by the judge, and from the failing cases TNR we have 85% of the cases correctly classify by the judge, think about this as the prediction power of the judge (not the right concept but same idea) as you label more data the judge is getting closer to their actual "power" with less variability, its getting closer to the "truth". \n\nThe whole point of this slide is to show that labeling more data helps to align the judge with their actual capabilities, and after 100 observations, you might be good enough.',
     'id': '1381990575703851139',
     'timestamp': '2025-06-10T13:37:09.391000+00:00'},
    {'author': 'sh_reyashankar',
     'content': "ah thanks all for clarifying. the behavior at 0 is undefined; don't read into it\n\n> Also Fixed TPR and TNR is a bit confusing here, aren't they going to change when number of examples increase, otherwise how the corrected theta is going to change?\n\ngood question, this will never happen in practice. the plot just shows a hypothetical scenario of a fixed TPR and TNR, so we can clearly see the impact of other variables",
     'id': '1382062613852328067',
     'timestamp': '2025-06-10T18:23:24.624000+00:00'}],
   'replies': [{'author': 'roshie0465',
     'content': '',
     'id': '1381901954837512263',
     'timestamp': '2025-06-10T07:45:00.530000+00:00'}]},
  {'question': {'author': 'amitkumar01570',
    'content': "Hi everyone! 👋\nJust wanted to share a quick summary of my understanding of LLM-as-a-Judge so far.\nPlease feel free to correct me if I’ve misunderstood anything. Thank you! 🙏\n\n\n✅ 1. Define a Rubric\nWe first create a clear and specific rubric that answers the question:\u2028👉 What counts as a hallucination in this use case?\nThe rubric includes:\n* A 1-line definition (e.g., “Generated content not supported by source context.”)\n* A few Pass examples (faithful outputs) and Fail examples (hallucinated content)\n* Notes for edge cases (e.g., partial hallucinations, ambiguous summaries)\n🛠️ This ensures humans and LLM judges evaluate consistently.\n\n✅ 2. Label a Gold Dataset\nNext, humans annotate a set of model outputs using the rubric.\nWe build a balanced set of:\n* ✅ Human-labeled Pass cases (no hallucination)\n* ❌ Human-labeled Fail cases (hallucination observed)\n📦 Typically, we label 50–100 examples for a reliable evaluation baseline.\n\n✅ 3. Split the Data\nWe split this labeled dataset into:\n* Train: ~10% used as few-shot examples inside the LLM judge prompt\n* Dev: ~40–45% to tune the prompt and test early results\n* Test: ~40–45% held out to evaluate the judge's true accuracy\n🎯 Key Rule: Test examples should never be reused in prompt, to avoid overfitting.\n\n✅ 4. Write the LLM-as-Judge Prompt\n* Write a prompt instructing the LLM to evaluate whether the output contains hallucinations.\n* The prompt includes clear instructions and few-shot examples.\n* This helps the LLM judge learn how to classify Pass/Fail decisions.",
    'id': '1384195796035043439',
    'timestamp': '2025-06-16T15:39:54.889000+00:00'},
   'replies': [{'author': 'amitkumar01570',
     'content': '✅ 5. Evaluate Judge Accuracy (TPR/TNR)\n* Run the LLM-as-Judge on the test set.\n* Compare its decisions to human labels.\n* Calculate:\n    * True Positive Rate (TPR): Judge correctly identifies Pass cases.\n    * True Negative Rate (TNR): Judge correctly identifies Fail cases.\nExample : \u2028     •    True Positive Rate (TPR) = 90 / (90 + 10) = 0.90\u2028        \n                                 True Negative Rate (TNR) = 80 / (80 + 20) = 0.80\u2028  \n 🎯 This tells us how good the judge is at catching correct Passes/Fails\u2028\n\n\n✅ 6. Run on Production Data & Observe Raw Success Rate\nNow we sample real (unlabeled) production outputs.\nSuppose:\n* 1,000 outputs judged by LLM-as-Judge\n* 850 judged as “Pass”\n    We calculate P obs = 850/1000\n⚠️ But since the judge is imperfect, this raw number is biased.\n\n✅ 7. Correct the Observed Pass Rate\nTo adjust for the judge’s bias, we apply the correction formula:\n\\thetâ = \\frac{p_{obs} + TNR - 1}{TPR + TNR - 1}\nUsing the example:\n* TPR = 0.90\n* TNR = 0.80\n* p obs=0.85\n\\thetâ = \\frac{0.85 + 0.80 - 1}{0.90 + 0.80 - 1} = \\frac{0.65}{0.70} = 0.928\n✅ So, estimated true hallucination-free rate is 92.8%\n\n✅ 8. Calculate 95% Confidence Interval (CI)\nWe now use bootstrapping to measure how confident we are in our estimate.\nProcess:\n* Randomly resample test data (1000 times)\n* Recalculate TPR, TNR, and θ̂ each time\n* Sort θ̂ values → get:\n    * Lower bound (2.5th percentile)\n    * Upper bound (97.5th percentile)\nExample:\nCI = [85.4%, 97.6%]\n🎯 This means we are 95% confident that the true hallucination-free rate lies within this range.',
     'id': '1384196108334399498',
     'timestamp': '2025-06-16T15:41:09.347000+00:00'}]},
  {'question': {'author': 'skylarpayne_33994',
    'content': "I like working out equations and formulas to help my own retention/understanding. I didn't see a proof / derivation of the AI judge bias correction, so I put together a quick latex doc in case its helpful for others.\n\nI tried to put the probability rules I used there, but it will require some basic background in understanding probability (eg if you've taken an introductory probability and statistics course in the past).\n\nHappy to answer any questions about it if it doesn't make sense too",
    'id': '1386455505806819370',
    'timestamp': '2025-06-22T21:19:11.682000+00:00'},
   'replies': [{'author': 'sh_reyashankar',
     'content': 'This is great. We will put a derivation in the reader as well for cohort #2',
     'id': '1386540979959365673',
     'timestamp': '2025-06-23T02:58:50.308000+00:00'},
    {'author': 'amitkumar01570',
     'content': 'This is great and extremely helpful. Thank you so much!',
     'id': '1387031620258762834',
     'timestamp': '2025-06-24T11:28:28.071000+00:00'}]},
  {'question': {'author': 'powerful_raccoon_36894',
    'content': "Hey <@893327214685343804> <@525830737627185170> I'm struggling to come up with the right evals strategy for a project, and would love your thoughts. \n\nSo I'm building a financial podcast curator that curates through a list of podcasts to find the most relevant episodes in the last 7 days, based on pre-defined keywords (i.e. gold, inflation, etc) The curator agent scores podcasts metadata (title, descripion, and web search contexts) and lists the top 10 most relevant episodes by score.\n\nBy looking at the final list, I can tell some results are relevant ( because I listened to the episodes before), but I have no idea if they are truly the most relevant out of all ~200 episodes weekly (i'm deeply worried that it might miss out on some truly good episodes). How might I develop tests/evals to improve this product? It seems quite laborous to go through all 200 episodes to find the top 10 to give this to the agent as a test case. Thanks and would appreicate if you can point me in the right direction.",
    'id': '1399411182456275054',
    'timestamp': '2025-07-28T15:20:25.700000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'If you are trying to figure out how to build a recommendation system, which I think this is what it is, then you need to go through and label some data if you can\'t rely on real user signals.  It sounds like there is a fair amount of ambiguity in terms of what "good" means to you. \n\nBefore you embark on evals, you should really spend some time looking through those 200 episodes and refine your rubric around what good means. 200 episodes doesn\'t really seem like a lot, to be honest, and it could be a good exercise to start with open coding. Take notes about what episodes are good and what aren\'t good. If you\'re unable to decide what episodes are good and not good, this is a smell that your product is ill-defined or you don\'t have a good idea of what you want in the first place. This means you have to go back to the drawing board and think carefully about what you want to build and what\'s in scope.',
     'id': '1399421322139930654',
     'timestamp': '2025-07-28T16:00:43.189000+00:00'},
    {'author': 'powerful_raccoon_36894',
     'content': "Thanks, this is helpful. At the back of my head I guess I knew the hard work looking at data is needed... I'm planning to do open encoding to collect two signals: 1) rate an episode on a scale of relevance (i.e. possibly 0-2, 0 being not relevant, 1 being 20-50% content relevant, 2 being >50% content relevant) 2) Collect/ideate on features that explain relevance (i.e. guest speaker authority, topic, etc). Any comments/recommendations on this open encoding plan?",
     'id': '1399438894797164694',
     'timestamp': '2025-07-28T17:10:32.837000+00:00'},
    {'author': 'hamelh',
     'content': 'Try to have a binary label, "relevant" or "not relevant", and be very discerning about what you mean by relevant',
     'id': '1399457842611748884',
     'timestamp': '2025-07-28T18:25:50.348000+00:00'}]},
  {'question': {'author': 'dustincoates',
    'content': 'Is it possible to get a view on what changed in the course reader? For those of us who have already read past chapter 5.',
    'id': '1399421575643795654',
    'timestamp': '2025-07-28T16:01:43.629000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'Tagging <@893327214685343804>',
     'id': '1399421968331046994',
     'timestamp': '2025-07-28T16:03:17.253000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'generally the edits are all about clarity. i added more detail in chapter 3 on what a trace is & answering some of the FAQ from lesson 2 channel',
     'id': '1399453164066836580',
     'timestamp': '2025-07-28T18:07:14.896000+00:00'}]},
  {'question': {'author': 'britter3116_22491',
    'content': "I think I understand how to build an eval system now, but I still have questions as to when to make an eval for a failure mode. \nFinding/curating data and aligning a judge for a failure mode takes some time, so presumably just one failed trace isn't enough. It might never show up again, especially if you're iterating on the main prompt.\n\nI realize this is getting into the financial and project management/ expectations of product performance, but do you have a heuristic about how many times you need to notice a failure before you build out an eval for it?",
    'id': '1399478381124255928',
    'timestamp': '2025-07-28T19:47:27.111000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "Great q. Typically when doing error analysis (before building LLM judges); you'll notice that there are a few one-off failure modes that don't fit cleanly into any category determined by axial coding. There can be a long tail of such failure modes. As you pointed out, it doesnt make sense to build an automated evaluator for such failure modes (not enough labeled data). I tend to ignore such failures (as long as they make up < 5% of errors).",
     'id': '1399538634759340093',
     'timestamp': '2025-07-28T23:46:52.697000+00:00'},
    {'author': 'hamelh',
     'content': 'Great question Brendan!  I documented it here as well https://hamel.dev/blog/posts/evals-faq/#q-should-i-build-automated-evaluators-for-every-failure-mode-i-find',
     'id': '1399541248754188420',
     'timestamp': '2025-07-28T23:57:15.922000+00:00'},
    {'author': 'britter3116_22491',
     'content': 'Oops! Should have searched more! Thank you guys for the response!',
     'id': '1399752774459199558',
     'timestamp': '2025-07-29T13:57:47.580000+00:00'},
    {'author': 'hamelh',
     'content': 'NO need to search more.  I just enjoyed the question so I documented it!',
     'id': '1399831736791208036',
     'timestamp': '2025-07-29T19:11:33.667000+00:00'}]},
  {'question': {'author': 'mcs4774',
    'content': "How should I be thinking about the difference between offline and online evaluations?  Can an LLM judge also be an online eval if it's running on sampled production data in near-real time? OR is that still technically offline?",
    'id': '1399666497760137306',
    'timestamp': '2025-07-29T08:14:57.612000+00:00'},
   'thread': [{'author': 'intellectronica',
     'content': "It's usually impractical (slow, expensive) to run these evals in real time.\n\nMore common is to capture traces of everything, then use these traces for evaluations. Initially, in development or soon after releasing a new version, you might really want to evaluate everything, or at least a large sample.\n\nAfter a while, when the system is stable, it makes more sense to wait for failures and work with those.\n\nWhat's important is - capture everything if you can. You'll never regret having real data to examine and evaluate later.",
     'id': '1399667782383304715',
     'timestamp': '2025-07-29T08:20:03.890000+00:00'},
    {'author': 'mcs4774',
     'content': "That makes sense from an observability perspective. I'm wondering how to define a clear line between what is deemed an offline eval versus an online eval? Or does no such line exist?",
     'id': '1399668317643604020',
     'timestamp': '2025-07-29T08:22:11.506000+00:00'},
    {'author': 'intellectronica',
     'content': "Typically in a production system you won't run the kind of thing we refer to as evals here. You might have guardrails and checks that are part of the actual system, but that's a different thing.",
     'id': '1399668618609819750',
     'timestamp': '2025-07-29T08:23:23.262000+00:00'},
    {'author': 'intellectronica',
     'content': "There's no rule that says you definitely shouldn't run evals in run-time, but I'm not sure there's much value in that. Maybe when observing complex agentic systems at work?",
     'id': '1399668875993415690',
     'timestamp': '2025-07-29T08:24:24.627000+00:00'},
    {'author': 'mcs4774',
     'content': 'I think code-based evals could run fine in real-time?',
     'id': '1399668961359954060',
     'timestamp': '2025-07-29T08:24:44.980000+00:00'},
    {'author': 'mcs4774',
     'content': "I guess what I was wondering is, do all 'online-evals' have to run in the same request?",
     'id': '1399669199252750376',
     'timestamp': '2025-07-29T08:25:41.698000+00:00'},
    {'author': 'intellectronica',
     'content': 'Yes, but do you mean using them for improving the system, or for influencing what the system does in real time? The latter I wouldn\'t really call "evals", although technically there\'s no difference.',
     'id': '1399669338767753286',
     'timestamp': '2025-07-29T08:26:14.961000+00:00'},
    {'author': 'mcs4774',
     'content': 'I would say the online evals are more for monitoring (alerting etc.) and sampling for downstream error analysis.',
     'id': '1399669519353647135',
     'timestamp': '2025-07-29T08:26:58.016000+00:00'},
    {'author': 'mcs4774',
     'content': "So the things I'm thinking are online evals atm are:\n- Product metrics (e.g. if we A/B test a system)\n- Customer feedback signals (implicit/explicit)\n- Guardrails (that can block responses)\n- Fast running evals that run on real-time to provide signal (sampling, and again monitoring)??? <-- I guess this bit i'm unsure",
     'id': '1399670009982226463',
     'timestamp': '2025-07-29T08:28:54.991000+00:00'},
    {'author': 'intellectronica',
     'content': "If you're sampling for downstream error analysis, there's no need to run the eval in real-time, just to capture traces.\nUsing cheap and fast evals for monitoring, yes, I can see how that can be a useful thing to do. Again, regarding terminology, I think that's on the borderline between eval and guardrail, and maybe more the latter.",
     'id': '1399670091859234869',
     'timestamp': '2025-07-29T08:29:14.512000+00:00'},
    {'author': 'intellectronica',
     'content': 'Basically:\n- If you need the output of the "eval" to influence the work of the system in real time (for example to disengage in a catastrophic failure), run it in real time (and I would probably call it a guardrail)\n- If you are collecting data for improving the system (A/B testing, error analysis, etc...) there\'s no need to run in real time, just record traces.',
     'id': '1399670562644819988',
     'timestamp': '2025-07-29T08:31:06.756000+00:00'},
    {'author': 'mcs4774',
     'content': "Hmmm, feels like there is a gap here for evals that run in real-time (but don't block a response)? https://arize.com/llm-evaluation/\n\nArize for example seperate guardrails and online evals.",
     'id': '1399671512163684404',
     'timestamp': '2025-07-29T08:34:53.139000+00:00'},
    {'author': 'mcs4774',
     'content': '> AI engineers don’t want to block or revise the output, but want to know immediately if something isn’t right.',
     'id': '1399671585765326880',
     'timestamp': '2025-07-29T08:35:10.687000+00:00'},
    {'author': 'mcs4774',
     'content': 'In their example, it seems to imply that are running an eval in real-time on traces to flag issues for immediate review (which seems to be an llm judge or sorts).',
     'id': '1399672229532274748',
     'timestamp': '2025-07-29T08:37:44.173000+00:00'},
    {'author': 'intellectronica',
     'content': "Definitely you wouldn't want to block the app while running evals.\nAnd it's probably nice to have everything evaluated and available immediately for review, if that's part of your process.",
     'id': '1399672522840211606',
     'timestamp': '2025-07-29T08:38:54.103000+00:00'},
    {'author': 'mcs4774',
     'content': 'https://docs.smith.langchain.com/observability/how_to_guides/online_evaluations\n\nAlso mention using llm judges in online evals',
     'id': '1399672757242953748',
     'timestamp': '2025-07-29T08:39:49.989000+00:00'},
    {'author': 'intellectronica',
     'content': "Cool. I have no experience working in this way, so maybe I'm lacking imagination a bit. But nice to see that this kind of process is supported by tools like LangSmith and Arize.",
     'id': '1399673475186032763',
     'timestamp': '2025-07-29T08:42:41.160000+00:00'},
    {'author': 'mcs4774',
     'content': "It seems that the core differentiation between online vs offline is primarily 'when' it runs. Online running in near-real time to flag issues immediately for investigation. Also online evals can only be reference-free. \n\nBe good to hear <@525830737627185170> <@893327214685343804> thoughts. \n\nIt seems many evals can run as both offline, guardrail, and online.",
     'id': '1399674203241713674',
     'timestamp': '2025-07-29T08:45:34.742000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'I agree with everything mentioned in this thread. Strategy-wise, you may want to think about what the purpose of your online eval is going to be. A guardrail that runs “in the critical path” of the user query (if the guardrail fails you retry)? Or a background eval for your dashboards? For guardrails I don’t recommend using LLM judges unless (a) your judge runs fast, and (b) your judge has low false positive rate. That is, your judge should not fail & cause a retry if the output is actually good. So my recommendation is to determine the cost of your LLM judge guardrail (not just money; consider latency and unnecessary retries due to false positives), and if the cost outweighs the benefit, then it would make sense to deploy. \n\nWe will talk about CI CD in week 3!',
     'id': '1399778773817360447',
     'timestamp': '2025-07-29T15:41:06.310000+00:00'},
    {'author': 'gsin6792',
     'content': 'Great thread! i found this article very informative on evals driven design: https://cookbook.openai.com/examples/partners/eval_driven_system_design/receipt_inspection',
     'id': '1399806621332279517',
     'timestamp': '2025-07-29T17:31:45.675000+00:00'},
    {'author': 'hamelh',
     'content': '<@536265259472912385> Hi took notes here as well incase it is helpful.  Thanks for your question\n\nhttps://hamel.dev/blog/posts/evals-faq/whats-the-difference-between-guardrails-evaluators.html',
     'id': '1399831992585027705',
     'timestamp': '2025-07-29T19:12:34.653000+00:00'}]},
  {'question': {'author': 'tech_appsmith',
    'content': 'While evaluating a coding agent—where the output is code generated based on user requirements—how much of this guidance and process (mentioned in the reader) would still be applicable? Are there any best practices or recommendations specifically tailored for evaluating such agents, or would the existing approach apply to them as well?',
    'id': '1399671514470809652',
    'timestamp': '2025-07-29T08:34:53.689000+00:00'},
   'thread': [{'author': 'intellectronica',
     'content': "The principles are the same. But the kind of evaluation will be different, because to evaluate code generation you'll want to verify that the generated code does exactly what it's supposed to do, so that's different from evaluating a linguistic response.",
     'id': '1399673110361280575',
     'timestamp': '2025-07-29T08:41:14.179000+00:00'},
    {'author': 'tech_appsmith',
     'content': 'Are there any guides available on how to approach this—such as which metrics to use and how to effectively evaluate the generated code?\nTagging <@893327214685343804> & <@525830737627185170> as well.',
     'id': '1399689305428983810',
     'timestamp': '2025-07-29T09:45:35.384000+00:00'},
    {'author': 'kimsia',
     'content': 'Hey <@1397263078818844773> eval coding agent is something a couple of us are interested as well\n\n\nI wrote a separate thread 🧵 will tag you there as well',
     'id': '1399721535995904150',
     'timestamp': '2025-07-29T11:53:39.750000+00:00'},
    {'author': 'tech_appsmith',
     'content': 'Thank you.',
     'id': '1399722281701347508',
     'timestamp': '2025-07-29T11:56:37.540000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'Yep open and axial coding (error analysis) applies and the process is the same. Your failure modes that emerge from your error analysis will be different though compared to something like nurtureboss or recipe bot',
     'id': '1399783894735388753',
     'timestamp': '2025-07-29T16:01:27.232000+00:00'},
    {'author': 'hamelh',
     'content': 'You might be interested in this https://youtu.be/LwLxlEwrtRA?si=w475IZq5eG04YI1N\n\n<@407821522758008852>',
     'id': '1399784106010874008',
     'timestamp': '2025-07-29T16:02:17.604000+00:00'},
    {'author': 'kimsia',
     'content': 'Yup Eleanor drew my attention to this as well',
     'id': '1399918805685112934',
     'timestamp': '2025-07-30T00:57:32.509000+00:00'}]},
  {'question': {'author': 'yayat4032',
    'content': "Following the reading, I have two questions about inter-annotator metrics that I'd be open to hearing perspectives on:\n\n1. As a scientist in a former life, it’s clear to me why we would need to adjust a metric to account for chance. however, there are members of my product team that insist on percent agreement being the preferred metric for measuring agreement between annotators.  Does anyone have suggestions or good explanations for less technical stakeholders for why metrics like Cohen’s or Fleiss’ Kappa are important? Or maybe conversely, examples where the percent agreement may be superior?\n\n2. In the case of multi-modal data (e.g. images accompanying text), are there standard or commonl y used metrics for measuring agreement when it’s not simply binary or categorical?\n\nThanks!",
    'id': '1399715774188490783',
    'timestamp': '2025-07-29T11:30:46.028000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': "Explanations for your stakeholders;\n\n- If your product only has 10% errors, random guessing can produce 90% agreement. \n- CK helps adjust for imbalanced data so you aren't misleading yourself. \n- % agreement not really superior strictly ever, but ok if data is balanced (which is usually not). \n\n2. Why can't you have binary judgements on images?  I see no reason you cannot!",
     'id': '1399849386216915174',
     'timestamp': '2025-07-29T20:21:41.618000+00:00'},
    {'author': 'yayat4032',
     'content': 'Thanks for the explantations!\n\nI’ll clarify my original question about images. We have a lot of customers who use bounding box annotations on images, and I’m looking for ways to calculate a score or evaluation metric that, given a pair of images with bounding boxes on them, are there ways to calculate agreement when geometry/IoU is involved in addition to classification agreement?',
     'id': '1400109885710143510',
     'timestamp': '2025-07-30T13:36:49.537000+00:00'}]},
  {'question': {'author': 'tech_appsmith',
    'content': '<@525830737627185170>, <@893327214685343804> While building a multi-agent pipeline in its early stages, we tend to make frequent changes almost daily. How often should I run error analysis in such a dynamic environment? Running it after every significant change (which happens quite often) feels excessive.',
    'id': '1399743840847659008',
    'timestamp': '2025-07-29T13:22:17.641000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': "Answering in thread here:\n\nLet your system stabilize a bit before jumping into the evals process. Meaning you don't want to start the evals process too early. You still need to have an idea of what you want to build and iterate on it. So it's totally fine to go from zero to one with no evals. Evals are important when you are trying to make things work better. But if you are just prototyping, it may not make sense.\n\nHowever, what you do want to do is occasionally look at some data and do some open coding. You don't need to go all the way to axial coding or whatever it's up to you, but you will get value looking at the data. As your system stabilizes, you will intuitively know when it's time to optimize things",
     'id': '1399849589531738254',
     'timestamp': '2025-07-29T20:22:30.092000+00:00'}],
   'replies': [{'author': 'hamelh',
     'content': "Let your system stabilize a bit before jumping into the evals process. Meaning you don't want to start the evals process too early. You still need to have an idea of what you want to build and iterate on it. So it's totally fine to go from zero to one with no evals. Evals are important when you are trying to make things work better. But if you are just prototyping, it may not make sense.\n\nHowever, what you do want to do is occasionally look at some data and do some open coding. You don't need to go all the way to axial coding or whatever it's up to you, but you will get value looking at the data. As your system stabilizes, you will intuitively know when it's time to optimize things",
     'id': '1399782399981781093',
     'timestamp': '2025-07-29T15:55:30.855000+00:00'}]},
  {'question': {'author': 'handraisedbison',
    'content': "<@525830737627185170> <@893327214685343804> General question.  Is it possible to get a version of the course notes that's loaded into NotebookLM?  It's my default learning assistant.  I could certainly parse and upload it, but seems like a universally helpful resoure.",
    'id': '1399778477242060821',
    'timestamp': '2025-07-29T15:39:55.601000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'Feel free to.  We are constantly updating the notes, and we are not ready to share the notes outside the course (We also ask you keep the course notes private to yourself)',
     'id': '1399849825717190737',
     'timestamp': '2025-07-29T20:23:26.403000+00:00'},
    {'author': 'handraisedbison',
     'content': "Understood.  Wasn't asking to share broadly, just wondering  out loud if others in the course would also value having the NotebookLM version.",
     'id': '1399871919154659360',
     'timestamp': '2025-07-29T21:51:13.889000+00:00'},
    {'author': 'hamelh',
     'content': "Yeah feel free to share in the discord.  There are lots of notebook LMs I've seen already!",
     'id': '1400139475270762638',
     'timestamp': '2025-07-30T15:34:24.238000+00:00'}]},
  {'question': {'author': 'barbarianlibarian',
    'content': 'What amount of hours should we plan on asking the SMEs to dedicate on evaluation? Obviously it would change by complexity but if we had to ballpark it?',
    'id': '1399794203613200511',
    'timestamp': '2025-07-29T16:42:25.060000+00:00'},
   'thread': [{'author': 'intellectronica',
     'content': "It's really hard to give a number of hours, it's so context dependent. But I think the important idea to bring into the organisation is that reviewing is core activity and the main contributor to the quality of the AI product, not a nice-to-have or something you do a little bit on top of everything else. And also that it's repeated, not just a one off, because over time you have more data, better precision on categorisation, the system evolves, etc...",
     'id': '1399797199621324942',
     'timestamp': '2025-07-29T16:54:19.364000+00:00'},
    {'author': 'barbarianlibarian',
     'content': 'Makes sense. Just trying to plan for the ask since our SMEs are working full time on their current work and it will be different SMEs for every project. Perhaps we will have to go through a few rounds and get a good idea. Do you have any data on time spent on a project you’ve worked on?',
     'id': '1399799787993432124',
     'timestamp': '2025-07-29T17:04:36.480000+00:00'}]},
  {'question': {'author': 'alp1na0141',
    'content': '<@525830737627185170> <@893327214685343804> Suggestions on how one may eval pulling out / summarizing data from a transcript?  How might the trace be best setup?',
    'id': '1399799484476690623',
    'timestamp': '2025-07-29T17:03:24.116000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "I like to build a custom dashboard for document LLM output review. left half is the document w/highlighted terms that are key words in the LLM output. right half is the llm output. if the llm's job is extraction, you should be able to highlight the full extracted content in the document on the left!",
     'id': '1399815634690375780',
     'timestamp': '2025-07-29T18:07:34.627000+00:00'}]},
  {'question': {'author': 'sirjfdidymus',
    'content': "I think that's the field in the spreadsheet where she explains the failure mode",
    'id': '1399801458391781486',
    'timestamp': '2025-07-29T17:11:14.734000+00:00'}},
  {'question': {'author': 'bigcake3',
    'content': 'Open coding is the process of taking notes that are called znotes?',
    'id': '1399801668337668136',
    'timestamp': '2025-07-29T17:12:04.789000+00:00'},
   'replies': [{'author': 'chiiz8724',
     'content': 'using openai',
     'id': '1399801860600369316',
     'timestamp': '2025-07-29T17:12:50.628000+00:00'}]},
  {'question': {'author': 'handraisedbison',
    'content': 'I assume <@525830737627185170> is referencing a jupyter notebook',
    'id': '1399802070902636544',
    'timestamp': '2025-07-29T17:13:40.768000+00:00'},
   'thread': [{'author': 'nlp_mischief',
     'content': 'Could well be something like Marimo as well',
     'id': '1399804072315584633',
     'timestamp': '2025-07-29T17:21:37.942000+00:00'},
    {'author': 'handraisedbison',
     'content': 'Or generically:   an interactive document that combines live code, equations, visualizations, and explanatory text in a single shareable file that executes in a web browser.',
     'id': '1399804538847760454',
     'timestamp': '2025-07-29T17:23:29.172000+00:00'}]},
  {'question': {'author': 'gythaogg',
    'content': "This is ZNote, it's a Notion kind of thing: https://znote.io/",
    'id': '1399802258786488320',
    'timestamp': '2025-07-29T17:14:25.563000+00:00'}},
  {'question': {'author': 'josh.9764',
    'content': "<@525830737627185170> <@893327214685343804>  Do y'all use meta prompts to allow your agents outputs notes about issues they have?",
    'id': '1399802334132830361',
    'timestamp': '2025-07-29T17:14:43.527000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'lol! no. not that meta (yet)',
     'id': '1399815952908161157',
     'timestamp': '2025-07-29T18:08:50.496000+00:00'},
    {'author': 'josh.9764',
     'content': 'Two examples',
     'id': '1399821824354816161',
     'timestamp': '2025-07-29T18:32:10.358000+00:00'}]},
  {'question': {'author': 'manisnesan',
    'content': 'I faced a similar issue yesterday which <@893327214685343804> faced today with claude code in Lesson 3 Lecture. Had to steer the AI to manually review the annotation along with the example failure modes. First one at a time to make sure it aligns with my understanding of failure mode atleast for 5 samples. Then once it aligns with my understanding, then I asked it to complete the whole Axial Coding.',
    'id': '1399802842016911497',
    'timestamp': '2025-07-29T17:16:44.616000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'yes! LLMs never work perfectly for me 😦',
     'id': '1399816035879751751',
     'timestamp': '2025-07-29T18:09:10.278000+00:00'}]},
  {'question': {'author': 'chandra_80705',
    'content': '<@525830737627185170> - can we get the distribution of how many times the sub categories show up - to help focus on the key nodes? thx',
    'id': '1399803544437264445',
    'timestamp': '2025-07-29T17:19:32.086000+00:00'}},
  {'question': {'author': 'chiiz8724',
    'content': 'Why arent we doing one hot encoding for axial coding like it said in the textbook?',
    'id': '1399803955734909029',
    'timestamp': '2025-07-29T17:21:10.147000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'You can do 1 hot or you can just note the axial codes & run a script to convert the axial codes to 1 hot codes! I just wanted to stay within braintrust',
     'id': '1399816209649897503',
     'timestamp': '2025-07-29T18:09:51.708000+00:00'}]},
  {'question': {'author': 'alp1na0141',
    'content': 'Does the axial code need to match the taxonomy exactly, <@893327214685343804>  used a phrase that was not exactly the same.',
    'id': '1399804358300012624',
    'timestamp': '2025-07-29T17:22:46.126000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "it should match so you can programmatically group by the axial codes & collect the traces that exhibit that failure mode. it's totally fine to change the axial code names but you should update all references to it (which i did not do, oops)",
     'id': '1399816897389789204',
     'timestamp': '2025-07-29T18:12:35.678000+00:00'}]},
  {'question': {'author': 'zeldrinn',
    'content': "<@893327214685343804> you could also try using o3 for scripting up the json parsing / axial coding. i find 4.1 makes more errors when it comes to coding (it's also worse on swe bench).",
    'id': '1399804374531964979',
    'timestamp': '2025-07-29T17:22:49.996000+00:00'}},
  {'question': {'author': 'amicoin',
    'content': '<@893327214685343804> in the annotations for BT could we somehow further annote the text with more information we could use for analysis later in addition to Axial coding i.e. have values that relate to - Error Type: Is it a hallucination, factual inaccuracy, logical inconsistency, grammatical error, safety concern, or something else?\n\nSeverity: How critical is the error? Does it just sound wrong, or does it lead to misinformation, is it harmful to our brand? \nSource of Error: Is it likely due to prompt engineering, fine-tuning data, model architecture limitations, or inference issues or needs investigation?\nContext of Error: Under what conditions did the error occur (e.g., specific prompt structure, domain, length of output, context window issues)?\nUser Impact: How does or will the error affect the user experience or downstream application?\nEtc..\n\nThat can be searched on or analysed later. Thanks.Ami',
    'id': '1399804773317869568',
    'timestamp': '2025-07-29T17:24:25.074000+00:00'},
   'thread': [{'author': 'wayde_bt',
     'content': 'Yes.\n\nhttps://www.braintrust.dev/docs/guides/human-review#writing-to-expected-fields',
     'id': '1399808100101390427',
     'timestamp': '2025-07-29T17:37:38.241000+00:00'},
    {'author': 'hamelh',
     'content': "While you can definitely annotate all of these things, I would not recommend doing this in the open coding phase. Remember, open coding is for you to just observe the error, not to troubleshoot it, not to sort of ruminate on the trace for too long. You need to get a sense of what are the most common types of errors you have and not get bogged down in any one individual trace with too many reflections.\n\nWe will discuss this later but It's important to try to have binary judgments and make a call on whether or not something is passable rather than trying to hedge too much",
     'id': '1399850736430616577',
     'timestamp': '2025-07-29T20:27:03.534000+00:00'}]},
  {'question': {'author': 'annuaugustine_97667',
    'content': 'I think you want to hang on for fixing until you have done a 1 - 2 iterations',
    'id': '1399804903303549131',
    'timestamp': '2025-07-29T17:24:56.065000+00:00'}},
  {'question': {'author': 'timschukar_27054',
    'content': 'Do you have a recorded walk-through of your Julius approach?',
    'id': '1399805163081826384',
    'timestamp': '2025-07-29T17:25:58.001000+00:00'}},
  {'question': {'author': 'margaritafakih_01112',
    'content': 'There where a few options per error type, does it matter which one do we use for the axial coding exercise?',
    'id': '1399805350810620025',
    'timestamp': '2025-07-29T17:26:42.759000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "i like to use the fine grained label--in the next lecture we'll talk about LLM judges & LLM judges work better on fine grained labels",
     'id': '1399820369472524319',
     'timestamp': '2025-07-29T18:26:23.487000+00:00'}]},
  {'question': {'author': 'zeldrinn',
    'content': "one thing i often get hung up on with axial coding is deciding what the appropriate categorization principles are. there are ~infinitely many ways to categorize errors, and a given error could easily be put in bucket A or bucket X depending on how you've defined the buckets. relatedly, it can be difficult to define MECE (mutually exclusive, collectively exhaustive) axial codes. there often ends up being a fair bit of overlap / blurred lines between categories.",
    'id': '1399805539789045760',
    'timestamp': '2025-07-29T17:27:27.815000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "Really good point. this is why you will end up refining the axial codes as you go back and apply axial codes to the traces...if you find that it is really hard to bucket the trace into a code or two codes have too much overlap, you'll want to refine the codes to be more MECE. MECE is a great term btw",
     'id': '1399820883417104516',
     'timestamp': '2025-07-29T18:28:26.021000+00:00'},
    {'author': 'zeldrinn',
     'content': 'Makes sense, thanks Shreya! So don’t worry about creating the perfect buckets, just iterate toward better ones',
     'id': '1399854700999741543',
     'timestamp': '2025-07-29T20:42:48.761000+00:00'}],
   'replies': [{'author': 'erincode.org_64293',
     'content': 'I\'m curious if instructors will say this is best practice, but when I have a hierarchy and am conflicted between multiple "buckets" I default to the category highest in the hierarcy.',
     'id': '1399806954750218241',
     'timestamp': '2025-07-29T17:33:05.168000+00:00'}]},
  {'question': {'author': 'annuaugustine_97667',
    'content': 'Interesting, Shreya is talking about an approach similar to usability testing',
    'id': '1399805625135005718',
    'timestamp': '2025-07-29T17:27:48.163000+00:00'}},
  {'question': {'author': 'alp1na0141',
    'content': '<@525830737627185170> The Perturb trace example... I want to make sure I understand the example, it\'s taking a real-initial trace, then you manually added a known failure mode "synthetically"?',
    'id': '1399807312545321020',
    'timestamp': '2025-07-29T17:34:30.473000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Correct. we should call this data augmentation',
     'id': '1399821347789738006',
     'timestamp': '2025-07-29T18:30:16.736000+00:00'}]},
  {'question': {'author': 'annuaugustine_97667',
    'content': 'Chatgpt says - To perturb a trace means to tweak part of the input (or sometimes the output or prompt structure) without changing the core intent — to test how robust, consistent, or brittle the model is.',
    'id': '1399807437485244486',
    'timestamp': '2025-07-29T17:35:00.261000+00:00'}},
  {'question': {'author': 'powerful_raccoon_36894',
    'content': "What's your view on this multiturn eval approach from Langsmith  - simulating a conversation with 2 LLMs? Is it practical or would you always recommend real user testing/at least based on real data/perturb a trace? \n\nSounds like you're saying we should never use fully synthetic data for evals, ideally real data or at least partially synthetic https://github.com/langchain-ai/langsmith-cookbook/blob/main/testing-examples/chatbot-simulation/chatbot-simulation.ipynb",
    'id': '1399807606830272713',
    'timestamp': '2025-07-29T17:35:40.636000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'I think having 2 agents simulating conversation can make sense as long as the agents are grounded in some real insights observed from real traces. I recommend starting with real traces first, and then when you understand how these real traces vary and important dimensions of these real traces, you can describe them in the system prompt when simulating conversation',
     'id': '1399843046824349810',
     'timestamp': '2025-07-29T19:56:30.189000+00:00'}]},
  {'question': {'author': 'chiiz8724',
    'content': 'i feel this error analysis this can get political internally 😛',
    'id': '1399807779849506958',
    'timestamp': '2025-07-29T17:36:21.887000+00:00'}},
  {'question': {'author': 'vishal_learner',
    'content': "When perturbing a trace is the desired result that the bot sticks to its original constraints (i.e. it's robust against distractions) or that it is flexible and gives relevant responses (i.e. it adapts to changes in user needs)? Or is that use case dependent?",
    'id': '1399807848422047857',
    'timestamp': '2025-07-29T17:36:38.236000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'The whole point of perturbing traces or data augmentation is to trigger errors in your system while trying to be as realistic as possible. It requires a fair amount of judgment and is use case dependent.',
     'id': '1399851994189004901',
     'timestamp': '2025-07-29T20:32:03.407000+00:00'}]},
  {'question': {'author': 'gsin6792',
    'content': 'Evals is to AI what observability is to distributed systems — essential, continuous, and feedback-driven. Tests may help you ship, but evals help you learn and adapt continuously. \n\nhttps://cookbook.openai.com/examples/partners/eval_driven_system_design/receipt_inspection',
    'id': '1399807996460011542',
    'timestamp': '2025-07-29T17:37:13.531000+00:00'}},
  {'question': {'author': 'annuaugustine_97667',
    'content': 'the bigger the organisation, the more political this is going to be, appointing this "benevolent dictator"',
    'id': '1399808580617638019',
    'timestamp': '2025-07-29T17:39:32.805000+00:00'},
   'thread': [{'author': 'nlp_mischief',
     'content': 'In the spirit of this, how do you best resolve such an appointment?',
     'id': '1399808994058440825',
     'timestamp': '2025-07-29T17:41:11.377000+00:00'},
    {'author': 'hamelh',
     'content': "It doesn't have to be political at all. You should actually leverage the politics and find out who actually makes the decision. Usually, there are only a few people. With that, you can appoint somebody that they trust as the dictator.\n\nBut yes, there are always politics for any technology decision 😛",
     'id': '1399852293872160870',
     'timestamp': '2025-07-29T20:33:14.857000+00:00'}]},
  {'question': {'author': 'nlp_mischief',
    'content': 'Have you generally found people with prior exposure to open and/or axial coding make for better "Benevolent Dictators"?',
    'id': '1399808770179338371',
    'timestamp': '2025-07-29T17:40:18+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Yes. People who have data experience are really good at being a benevolent dictator but it’s most important to have the domain expertise',
     'id': '1399843294300733512',
     'timestamp': '2025-07-29T19:57:29.192000+00:00'}]},
  {'question': {'author': 'julio.paulillo',
    'content': "When I create Evals around Multi-turn conversations, if I change the data structure to make corrections (like tool's names, arguments, tools outputs, user/assistant messages structure), should I discard this evals dataset and create a new one?",
    'id': '1399808782103478463',
    'timestamp': '2025-07-29T17:40:20.843000+00:00'},
   'thread': [{'author': 'wayde_bt',
     'content': "I don't think you need to discard the original inasmuch as curate a new one to use in building evals for whatever error(s) you find in the full trace.",
     'id': '1399809124534976554',
     'timestamp': '2025-07-29T17:41:42.485000+00:00'},
    {'author': 'gsin6792',
     'content': 'If structural changes affect output behavior or interpretation, start a new eval set, but we can keep the old one for comparison and longitudinal learning. For example anthropic/open ai have followed in their papers/blogs that structure and task framing changes require new gold labels, but they maintain older evals for trend analysis across model improvements.\n\nSo I guess knowing what the impact of change is, is the key? May be to be sure do sample test on the previous version of eval set?',
     'id': '1399810550728364183',
     'timestamp': '2025-07-29T17:47:22.516000+00:00'},
    {'author': 'julio.paulillo',
     'content': 'Hey <@1384234134288863443> <@1110812918230032458>, thanks for collaborating!\n<@1110812918230032458> do you have links for these papers/blogs?',
     'id': '1399812812238356582',
     'timestamp': '2025-07-29T17:56:21.702000+00:00'},
    {'author': 'gsin6792',
     'content': 'Some ideas here: https://platform.openai.com/docs/guides/evals-design',
     'id': '1399816306399772772',
     'timestamp': '2025-07-29T18:10:14.775000+00:00'},
    {'author': 'gsin6792',
     'content': 'Some more (not explicit but you would get the idea): https://www-cdn.anthropic.com/files/4zrzovbb/website/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226.pdf?utm_source=chatgpt.com',
     'id': '1399816717605142538',
     'timestamp': '2025-07-29T18:11:52.814000+00:00'},
    {'author': 'julio.paulillo',
     'content': 'Thank you!',
     'id': '1399817448286326825',
     'timestamp': '2025-07-29T18:14:47.022000+00:00'},
    {'author': 'gsin6792',
     'content': 'I found this blog too: https://addyosmani.com/blog/ai-evals/',
     'id': '1399818252569284832',
     'timestamp': '2025-07-29T18:17:58.778000+00:00'},
    {'author': 'hamelh',
     'content': 'Wayde is right here. You don\'t need to throw everything away. You can always adapt to what you have or curate a new dataset. You just have to make sure that the dataset is always representing the current state of your system faithfully.\n\nWhen I say "adapting what you have," what I mean is that sometimes you can programmatically fix the structure of your data, and sometimes that\'s not worth the effort though, and it can be easier just to curate new data sets.',
     'id': '1399852644692131891',
     'timestamp': '2025-07-29T20:34:38.499000+00:00'},
    {'author': 'julio.paulillo',
     'content': '> You just have to make sure that the dataset is always representing the current state of your system faithfully.\nThat\'s a key point for me. Really appreciate <@525830737627185170> !\nBuilding on that, maybe storing all traces as JSON (or other easily parseable formats), would be a better approach to make this transformation easier? Like embedding a "middleware/interface" within my Evals Framework so I will always have this parsing layer for data structure portability?',
     'id': '1400166905746034808',
     'timestamp': '2025-07-30T17:23:24.173000+00:00'},
    {'author': 'hamelh',
     'content': 'Yeah sure - how you store yoru data and how you choose to render it is totally different.  Always save all your data if possible in a lossless format',
     'id': '1400167206343671959',
     'timestamp': '2025-07-30T17:24:35.841000+00:00'}]},
  {'question': {'author': 'chiiz8724',
    'content': 'Steve Jobs was a "benevolent dictator" sometimes not so benevolent',
    'id': '1399808977856106547',
    'timestamp': '2025-07-29T17:41:07.514000+00:00'}},
  {'question': {'author': 'aztristian',
    'content': 'Would each team have a PDE when split into team or is it too much here?',
    'id': '1399809207288598530',
    'timestamp': '2025-07-29T17:42:02.215000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Having 1 PDE spanning all teams might be the least friction bc there is less coordination required. But if you have so many teams that 1 PDE cannot do it all, then you need multiple',
     'id': '1399843747608399883',
     'timestamp': '2025-07-29T19:59:17.269000+00:00'}]},
  {'question': {'author': 'kunal_38373',
    'content': "What are people's thoughts on doing labeling of traces vs coming up with gold standard examples of output from scratch and then evaluating the LLM against those examples?",
    'id': '1399809303417716901',
    'timestamp': '2025-07-29T17:42:25.134000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Great question. Coming up with gold standard examples can be very time consuming, especially for complex agents or multi agent systems. But if it’s easy to come up with gold standard labels, then definitely you can use them to identify errors more easily',
     'id': '1399844192116670585',
     'timestamp': '2025-07-29T20:01:03.248000+00:00'}]},
  {'question': {'author': 'zeldrinn',
    'content': "<@893327214685343804> occasionally you unfurl the last few bullets of a slide after you've finished talking and then quickly change slides so there isn't enough time to read the last few bullets. would help if you give the last few bullets a tad more time.",
    'id': '1399809351513800847',
    'timestamp': '2025-07-29T17:42:36.601000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Thank you for this feedback!! I’ll adapt next lecture',
     'id': '1399843841980239992',
     'timestamp': '2025-07-29T19:59:39.769000+00:00'}]},
  {'question': {'author': 'wayde_bt',
    'content': "hardest thing in my life has been getting people who say they care about the AI product I'm building to actually care enough to be involved in error analysis.",
    'id': '1399809396032143442',
    'timestamp': '2025-07-29T17:42:47.215000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'That why you are ahead.  🙂',
     'id': '1399809570112671957',
     'timestamp': '2025-07-29T17:43:28.719000+00:00'},
    {'author': 'annuaugustine_97667',
     'content': "That is a real challenge, imagine a company building custom software for a client, and having to convince the client for an SME's time",
     'id': '1399809658092257403',
     'timestamp': '2025-07-29T17:43:49.695000+00:00'}],
   'replies': [{'author': 'robertfranklin_20252',
     'content': 'If you need a volunteer love to help / learn by doing!   rfranklin66@gmail.com',
     'id': '1399809825885650995',
     'timestamp': '2025-07-29T17:44:29.700000+00:00'}]},
  {'question': {'author': 'letty_32264',
    'content': 'I would think we can use the dimensions to create the rubric?',
    'id': '1399809508225581147',
    'timestamp': '2025-07-29T17:43:13.964000+00:00'}},
  {'question': {'author': 'neelimaputturu_78963',
    'content': '<@525830737627185170> For some reason, I am not able to get to "Slides"  for any lesson. It says "can\'t reach the page" . Is it because I am connected through my comapy\'s VPN?',
    'id': '1399809554006544405',
    'timestamp': '2025-07-29T17:43:24.879000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'yeah probably.  Try outside the vpn',
     'id': '1399809684705251558',
     'timestamp': '2025-07-29T17:43:56.040000+00:00'},
    {'author': 'neelimaputturu_78963',
     'content': 'I was able to get to BrainTrust Tutorial slides. Just not the lesson slides.',
     'id': '1399809980864921681',
     'timestamp': '2025-07-29T17:45:06.650000+00:00'},
    {'author': 'hamelh',
     'content': 'Try from your phone?  Might be your company network',
     'id': '1399847235281485996',
     'timestamp': '2025-07-29T20:13:08.795000+00:00'},
    {'author': 'hamelh',
     'content': "I just tested this, and it works. So, I'm not quite sure what the issue could be.",
     'id': '1399852837386588331',
     'timestamp': '2025-07-29T20:35:24.441000+00:00'},
    {'author': 'hamelh',
     'content': 'If you have trouble accessing links, please email Maven support. support@maven.com',
     'id': '1399852900263395451',
     'timestamp': '2025-07-29T20:35:39.432000+00:00'}]},
  {'question': {'author': 'combustability',
    'content': 'Are these annotations happening with each annotator separately reading through a csv?',
    'id': '1399809988280586351',
    'timestamp': '2025-07-29T17:45:08.418000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Typically yes (or a dashboard for each annotator)',
     'id': '1399844367044182116',
     'timestamp': '2025-07-29T20:01:44.954000+00:00'},
    {'author': 'sh_reyashankar',
     'content': '<@733447888788652102> will discuss in his guest lecture',
     'id': '1399844450196525096',
     'timestamp': '2025-07-29T20:02:04.779000+00:00'}]},
  {'question': {'author': 'nishantjain_27996_83974',
    'content': 'Will the program talk about alternatives to  annotations to derive low cost models',
    'id': '1399810000469360701',
    'timestamp': '2025-07-29T17:45:11.324000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Not sure I follow your question, can you elaborate? Thanks!',
     'id': '1399844607293919392',
     'timestamp': '2025-07-29T20:02:42.234000+00:00'},
    {'author': 'hamelh',
     'content': 'You have to put your eyes on some amount of data manually, meaning a human. Ultimately, you have to trust the model, and the only way to do that is to check its outputs. We are not proposing that you look at every single trace that is generated by your system, but rather you sample those traces. You might be interested in reading more about this here. We talk about smart sampling strategies.\n\nhttps://hamel.dev/blog/posts/evals-faq/how-do-i-surface-problematic-traces-for-review-beyond-user-feedback.html\n\nand\n\nhttps://hamel.dev/blog/posts/evals-faq/how-can-i-efficiently-sample-production-traces-for-review.html',
     'id': '1399853287112708247',
     'timestamp': '2025-07-29T20:37:11.664000+00:00'}]},
  {'question': {'author': 'powerful_raccoon_36894',
    'content': "Is it helpful to capture human/LLM confidence in annotations? This is especially for edge cases/grey areas where things could go either way, and we want a way to flag these for futher review. Or is that a sign we haven't developed criteria to be clear enough?",
    'id': '1399810039325265982',
    'timestamp': '2025-07-29T17:45:20.588000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'You can try but capturing confidence may be too much overhead if confidence doesn’t correlate with accuracy. Confidence is yet another thing for annotators to reason about',
     'id': '1399844846541340857',
     'timestamp': '2025-07-29T20:03:39.275000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'I would say — once annotators have experience and you have experience coordinating annotators, feel free to use confidence',
     'id': '1399845008051277906',
     'timestamp': '2025-07-29T20:04:17.782000+00:00'},
    {'author': 'hamelh',
     'content': "I would generally shy away from capturing confidence because that's a very messy metric and it's not a good idea. Like Shreya was saying in lectures, art scales (which are 1-5 ratings) are generally a bad idea.We discussed why they are a bad idea in the course reader, so I would encourage reading through that carefully.\n\nIf your team is an expert in evals, then like sh_reya says, you can maybe use other approaches, but do not do this at the outset.",
     'id': '1399853567581618186',
     'timestamp': '2025-07-29T20:38:18.533000+00:00'}]},
  {'question': {'author': 'vikingh27',
    'content': 'Does the workflow change if we are dealing with images? (such as document scanned/digital/invoices/receipts?)',
    'id': '1399810160154906665',
    'timestamp': '2025-07-29T17:45:49.396000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Nope! In a way images can be easier to evaluate bc you don’t have to read a long query or wall of text',
     'id': '1399845128083996764',
     'timestamp': '2025-07-29T20:04:46.400000+00:00'}]},
  {'question': {'author': 'nishantjain_27996_83974',
    'content': 'can we share some real life examples to backup 0.6 threshold',
    'id': '1399810548845252799',
    'timestamp': '2025-07-29T17:47:22.067000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Hard to share in detail bc we don’t have all the logs of all annotators; it’s just my rule of thumb but textbook numbers are > 0.6 for ok alignment and 0.8 for good alignment',
     'id': '1399845669124178071',
     'timestamp': '2025-07-29T20:06:55.394000+00:00'}]},
  {'question': {'author': 'henry_62300',
    'content': 'Re: cohen\'s kappa, I couldn\'t really find anything supporting the categorization of the values into poor/fair/substantial etc... Was just wondering why it would be valid to "accept" a cohen\'s kappa greater than .6 (seems arbitrary)',
    'id': '1399810586576949340',
    'timestamp': '2025-07-29T17:47:31.063000+00:00'}},
  {'question': {'author': 'letty_32264',
    'content': '<@525830737627185170> is the other session being recorded, Taming Diffusion QR...?',
    'id': '1399810937191399578',
    'timestamp': '2025-07-29T17:48:54.656000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'All sessions all courses always are recorded.  Always.',
     'id': '1399846832812855417',
     'timestamp': '2025-07-29T20:11:32.839000+00:00'}]},
  {'question': {'author': 'amicoin',
    'content': 'One bedroom apartments surely more questions need to be asked by the bot',
    'id': '1399811430949191752',
    'timestamp': '2025-07-29T17:50:52.377000+00:00'}},
  {'question': {'author': 'gojira3673',
    'content': 'Can you discuss interplay of error anlysis & IAA with preference ranking i.e. LMSys style',
    'id': '1399811514277298268',
    'timestamp': '2025-07-29T17:51:12.244000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'We discuss this in the course reader.  LMK if you have questions',
     'id': '1399856064387748051',
     'timestamp': '2025-07-29T20:48:13.818000+00:00'},
    {'author': 'gojira3673',
     'content': 'Ah ok so IAA for this is still just "agree on preference ranking"',
     'id': '1399940791073247253',
     'timestamp': '2025-07-30T02:24:54.234000+00:00'}]},
  {'question': {'author': 'gythaogg',
    'content': "But this isn't about liking yes/no...this is about the user journey that this bot should follow. That's something that's typically part of upfront design.",
    'id': '1399811804686979223',
    'timestamp': '2025-07-29T17:52:21.483000+00:00'}},
  {'question': {'author': 'alexharling',
    'content': 'The person said "soon" the response has to be time bound somehow',
    'id': '1399812073919221760',
    'timestamp': '2025-07-29T17:53:25.673000+00:00'}},
  {'question': {'author': 'gythaogg',
    'content': "Kai, that'typically what shouldn't be there 🙂  But because the user interface of most GenAI apps  is natural language, we can't help but using our primal language instincts and have opinions 🙂",
    'id': '1399812389557371040',
    'timestamp': '2025-07-29T17:54:40.927000+00:00'}},
  {'question': {'author': 'tyler_23238',
    'content': "Different re: Cohen's Kappa—this is pretty typical IAA but it seems to me that Krippendorff's alpha (mentioned in the text) is a much better default for everyone. Alpha generalizes better: it works with more than two annotators, handles missing data, and supports ordinal/interval data too. Cohen’s Kappa is really just for two raters doing exactly the same annotations on nominal data — and even then, it can be tricky.",
    'id': '1399812450580304043',
    'timestamp': '2025-07-29T17:54:55.476000+00:00'}},
  {'question': {'author': 'pjadester',
    'content': '"Trace" seems overloaded -- seems to \n(1) indicate the trace of a conversation between the User and the Prompt\n(2) trace of the request\'s traversal thru all the components of the pipeline\nIf my understanding is right, is there a way to draw a disctinction between the two',
    'id': '1399812539801407579',
    'timestamp': '2025-07-29T17:55:16.748000+00:00'},
   'thread': [{'author': 'aztristian',
     'content': 'I think trace is appropriate cause, not just involves messages but also tool calls, so similar to tradional service traces it shows hops between components/service when fulfilling the "request"',
     'id': '1399812804751392808',
     'timestamp': '2025-07-29T17:56:19.917000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'Ah good point. When we say trace we refer to (2) but sometimes for brevity we will show (1) on the slide',
     'id': '1399845337769705512',
     'timestamp': '2025-07-29T20:05:36.393000+00:00'}]},
  {'question': {'author': 'vikingh27',
    'content': "The bot should have addressed the 'soon' part as well. Right? There may be something good, but it's not available soon. And how soon - 1 week or 2 weeks?",
    'id': '1399812791359246487',
    'timestamp': '2025-07-29T17:56:16.724000+00:00'}},
  {'question': {'author': 'chetanarajabhoj2023',
    'content': 'I got disconnected from zoom. Can someone please send me the link , I don’t see joining link anymore on maven',
    'id': '1399812980627210501',
    'timestamp': '2025-07-29T17:57:01.849000+00:00'},
   'replies': [{'author': 'chiiz8724',
     'content': 'in your email?',
     'id': '1399813047148613815',
     'timestamp': '2025-07-29T17:57:17.709000+00:00'}]},
  {'question': {'author': 'zeldrinn',
    'content': 'so does splitting the criterion essentially lead to one-hot encoding?',
    'id': '1399813066232697016',
    'timestamp': '2025-07-29T17:57:22.259000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Not quite. Splitting a criterion simply means make it 2 different criteria to individually evaluate',
     'id': '1399846001140830408',
     'timestamp': '2025-07-29T20:08:14.553000+00:00'}]},
  {'question': {'author': 'letty_32264',
    'content': 'https://us06web.zoom.us/j/82514595488?pwd=7PTjJ2lVufq1r1mjfmTDcM51FCWkRc.1',
    'id': '1399813081466409152',
    'timestamp': '2025-07-29T17:57:25.891000+00:00'}},
  {'question': {'author': 'anton002962',
    'content': "what if developers often have the most context anyway?\n\ne.g. it's not uncomon for the developers / engineers to look at the error modes first, and fix them as a consequence\n\na bit of a hot take from you <@525830737627185170>; many teams cant easily get access to a dedicated human labeller either",
    'id': '1399813146620727380',
    'timestamp': '2025-07-29T17:57:41.425000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': 'Yes developers can have the most context!! Esp for code generation ai assistants',
     'id': '1399846549168586874',
     'timestamp': '2025-07-29T20:10:25.213000+00:00'},
    {'author': 'hamelh',
     'content': 'My comment around developers was to say that - developers usually do not have the most context, but of course, there are exceptions. For example, coding assistance.In fact, that is why coding assistants were one of the first products because those were the easiest things to develop given that there was no overhead required for separate domain experts.',
     'id': '1399856983221207040',
     'timestamp': '2025-07-29T20:51:52.885000+00:00'},
    {'author': 'anton002962',
     'content': 'Sure, but there are many other fields / use-cases beyond coding assistants where the developers are not just paratrooped into the project, but do have still context needed for evaluation.\n\nthanks for clarification though',
     'id': '1400190688049627188',
     'timestamp': '2025-07-30T18:57:54.316000+00:00'},
    {'author': 'hamelh',
     'content': 'Yes.  I would still argue on a macro level developers rarely have the needed context, so its just something to be very skeptical of unless you feel otherwise - because it is the default way people operate to "outsource" the entire thing to developers.  If you are a startup solo founder and a developer the entire product hinges on you being a domain expert (not just the AI).  So its not to say that developers are bad (I am a developer myself).  I\'m just saying that its an antipattern to throw it over the fence at a developer.',
     'id': '1400278500518400002',
     'timestamp': '2025-07-31T00:46:50.440000+00:00'}],
   'replies': [{'author': 'anton002962',
     'content': '<@893327214685343804> curiuos re your take too,  wasnt sure if tagging is good/bad habit here 🙂',
     'id': '1399816370207850668',
     'timestamp': '2025-07-29T18:10:29.988000+00:00'}]},
  {'question': {'author': 'mckngbrd_',
    'content': 'is the Rubric eventually going to be used for automated evals??/ LLM as a judge',
    'id': '1399813171790872586',
    'timestamp': '2025-07-29T17:57:47.426000+00:00'},
   'replies': [{'author': 'sirjfdidymus',
     'content': 'Yes. At least that is what I would do and what I read in the reader.',
     'id': '1399813319023657030',
     'timestamp': '2025-07-29T17:58:22.529000+00:00'}]},
  {'question': {'author': 'mugdha_64069',
    'content': 'The Rubric has multiple criteria. Do we have one final result for each trace -- good/bad or pass/fail? Or there is pass/fail per criterion?',
    'id': '1399813408173326437',
    'timestamp': '2025-07-29T17:58:43.784000+00:00'}},
  {'question': {'author': 'migs01398',
    'content': "Would you recommend revisiting old traces after you've implemented improvements from the collaborative error analysis?",
    'id': '1399813443250425919',
    'timestamp': '2025-07-29T17:58:52.147000+00:00'}},
  {'question': {'author': 'haris.jaliawala',
    'content': 'I think the times in Maven are wrong for the lecture today (which was slated for 45 mins) and the Taming diffusion one was listed for 1:45pm ET - 2:15pm ET',
    'id': '1399813715922260048',
    'timestamp': '2025-07-29T17:59:57.157000+00:00'},
   'thread': [{'author': 'haris.jaliawala',
     'content': 'No problem today, but I do rely on these times to block off my calendar from work! Would appreciate if they can be updated if they are meant to be hour long for future lectures too!',
     'id': '1399814005148749985',
     'timestamp': '2025-07-29T18:01:06.114000+00:00'},
    {'author': 'hamelh',
     'content': 'Yeah sorry about that  we will check the times',
     'id': '1399816064757534730',
     'timestamp': '2025-07-29T18:09:17.163000+00:00'},
    {'author': 'sh_reyashankar',
     'content': 'Sorry!',
     'id': '1399846309929947286',
     'timestamp': '2025-07-29T20:09:28.174000+00:00'}]},
  {'question': {'author': 'tech_appsmith',
    'content': 'Can Braintrust be used as an annotation tool? I noticed <@525830737627185170> emphasizes building our own, but just out of curiosity—would it be feasible to use Braintrust for this purpose as well?',
    'id': '1399822411037151373',
    'timestamp': '2025-07-29T18:34:30.234000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'Absolutely. You will see examples of this in the recorded workshops on tools that are coming.  <@1384234134288863443> will not let you down!',
     'id': '1399857522172625008',
     'timestamp': '2025-07-29T20:54:01.381000+00:00'},
    {'author': 'wayde_bt',
     'content': 'oh yah its there.  see here for a sneak peek ... https://www.braintrust.dev/docs/guides/human-review',
     'id': '1399872836780425257',
     'timestamp': '2025-07-29T21:54:52.668000+00:00'},
    {'author': 'tech_appsmith',
     'content': 'Thank you.',
     'id': '1399969315658137741',
     'timestamp': '2025-07-30T04:18:15.025000+00:00'}]},
  {'question': {'author': 'an_onel',
    'content': "This might be a more generic issue, but how do you handle error analysis when you are expecting different formatting for the output?\nLet's say we have a question and the answer might be formatted in 2 different ways: bullet points or dashes.\nBoth are correct, but where should you put the request to the model for the output format? System prompt of few shot?\n\nIf it's in the system prompt then it makes sense to do error analyses for both cases and catch the errors.\nBut if using few shot, than your error analyses looks different, right?\nHow do you start testing that? Is this a decision that you should make now at this stage?",
    'id': '1399840000962723871',
    'timestamp': '2025-07-29T19:44:23.999000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'In this particular case, I would say the "output format is not correct format...." for open code\n\n What you do in axial coding depends on the diversity of your open codes. If you\'re truly open coding, you will have an intuition on how fine-grained you want the axial codes to be, such as to split between different formats or just to say that there is a formatting problem. The idea is that you get an intuition of where your product fails so you can focus on that. It doesn\'t have to be perfect.',
     'id': '1399858033600888963',
     'timestamp': '2025-07-29T20:56:03.315000+00:00'}]},
  {'question': {'author': 'stylish_avocado_66151',
    'content': '<@525830737627185170> what other observability tools to get traces that you would recommend?',
    'id': '1399854076354625616',
    'timestamp': '2025-07-29T20:40:19.834000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': "This is an excellent question. I get this so much, and it's understandable why people want to know that I have documented my answer here. I hope this helps.\n\nhttps://hamel.dev/blog/posts/evals-faq/seriously-hamel-stop-the-bullshit-whats-your-favorite-eval-vendor.html",
     'id': '1399858230963994645',
     'timestamp': '2025-07-29T20:56:50.370000+00:00'}]},
  {'question': {'author': 'tech_appsmith',
    'content': '<@893327214685343804>, <@525830737627185170> Is collaborative evaluation or annotation primarily useful when using LLM-as-judge evaluators? I assume that if we’re using programmatic evaluators with clear binary pass/fail criteria, collaborative review might not add much value. Is that understanding correct?',
    'id': '1399993493212106812',
    'timestamp': '2025-07-30T05:54:19.403000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'One of the purposes of the annotations is help you create a test set against which you can verify reference free evals (llm judges). \n\nWait until future lectures we will be covering this!',
     'id': '1400027820163334195',
     'timestamp': '2025-07-30T08:10:43.586000+00:00'}]},
  {'question': {'author': 'chandra_06891',
    'content': "hello <@893327214685343804> , <@525830737627185170> - I was reading the notes on Cohen's Kappa metric and wanted to check if there are any good examples for using multiple LLM Judges and how the rubric should be designed.  I am using GPT and Claude to evaluate the business scenario and it is taking a ton of time iterating on the rubric, prompts and scoring model (High, medium, low vs. numerical score). Do you have any example papers or business implementations for using multiple LLM as a judge and best practices on define the rubric, prompts and scoring model. Happy to provide more info on the scenario- thx",
    'id': '1400011161574838284',
    'timestamp': '2025-07-30T07:04:31.869000+00:00'},
   'thread': [{'author': 'hamelh',
     'content': 'What do you mean you are using GPT to iterate on the rubric and prompt and scoring model \n\nDid you do error analysis?\nHow is the model going to come up with a good rubric, did you write one down according to your preferences already or are you telling a model to just make one? \n\nWe haven’t covered implementing the evaluators yet that’s in an upcoming lecture.  Also see the course reader if you want to learn ahead of time\n\nHigh / medium / low score makes me think that you could be going off track in the exact way we tell you not to in the course reader!\n\nMaybe I can convince you to put the course reader into chatgpt and ask it to critique your approach if you are eager to experiment ahead',
     'id': '1400027243622694974',
     'timestamp': '2025-07-30T08:08:26.128000+00:00'}]},
  {'question': {'author': 'chandra_06891',
    'content': 'Text from the book - *"Cohen’s Kappa is intended for measuring agreement\n between two human annotators who are peers. It is not used to evaluate\n LLM-as-Judge outputs against human labels that we consider “ground\ntruth.”36 Once human ground truth is established, we evaluate the LLM\n as a classifier, using metrics like TPR, TNR, FPR, and FNR (Section 5.3).\n In cases with more than two annotators, use Fleiss’ Kappa (Fleiss\n 1971). For ordinal, interval, or missing data, Krippendorff’s Alpha (Krip\npendorff 2011) may be more appropriate. But for most binary or categori\ncal labeling tasks with two raters, Cohen’s Kappa remains a solid choice." 6 Shreya’s Note: You can use Cohen’s\n Kappa to quantify whether LLM Judges\n agree with each other, or, between LLM\n and human judge. This will tell you how\n likely the human and LLM Judge are\n to agree. *',
    'id': '1400011686164824135',
    'timestamp': '2025-07-30T07:06:36.941000+00:00'}},
  {'question': {'author': 'maruti707',
    'content': "Hi <@893327214685343804> <@525830737627185170> - I am kind of excited to start using the learnings from the course in my work: 1/ Asked my annotation team at work to write detailed feedback on their manual validation task (for a classification problem) while creating ground truth. I'll aggregate, analyze, and categorize all the issues before impriving the system further.  2/ Implemented phoenix to capture traces in my open-source project. These traces is what i'll use throught this course https://github.com/marutilai/Katalyst 🙂",
    'id': '1400110381858422884',
    'timestamp': '2025-07-30T13:38:47.828000+00:00'}},
  {'question': {'author': 'osirhis',
    'content': '<@893327214685343804> <@525830737627185170>  When you say "do 2-3 rounds of open + axial coding" does it mean to follow the process for another 100 traces or should we first change the system prompt based on what we have learned and then look at another 100 traces? I\'m wondering what is the point of coding twice the same traces.',
    'id': '1400168477486223441',
    'timestamp': '2025-07-30T17:29:38.905000+00:00'},
   'thread': [{'author': 'sh_reyashankar',
     'content': "ah; sometimes if you find that there are very few failures found in your ~100 traces, you may want to generate more traces to do open coding for (and refine the axial coding taxonomy when applying axial codes). but if you find enough errors you don't need to keep doing open coding",
     'id': '1400190258552766565',
     'timestamp': '2025-07-30T18:56:11.916000+00:00'}]},
  {'question': {'author': 'noviceai',
    'content': 'We have an existing slack support channel for our products . We are working on converting the support channel into a AI chatbot . Our approach is to index the last 1 years worth of conversation on the slack channel  into a RAG pipeline . We are not using any synthetic data since we already have real data (Ground truth) .  How to do error analisys / evals  ?',
    'id': '1400192229334712452',
    'timestamp': '2025-07-30T19:04:01.787000+00:00'},
   'thread': [{'author': 'intellectronica',
     'content': "So, the data you are indexing is real and you don't need to augment it with synthetic data. But you still want to evaluate the indexing and retrieval, and for that you'll need queries.\n\nYou may want to generate queries (either manually or synthetically). Or, you can just use the queries the customers asked originally on the chat - in that case you don't even need to generate queries.\n\nLook closely at the data (both quesions and answers), though, to make sure that the original questions really give you enough coverage. It might be a good idea to augment them with generated queries in addition, if you suspect that they are not covering all possibilities.\n\nAnother thing to consider, assuming you are not evaluating _all_ the questions you captured in the chat, is that you sample from the complete dataset strategically, so that you get a representative sample.\n\nThen, once you have a set of questions, and ground truth answers to compate to, error analysis is no different from the process you learnt here. But you have the benefit of being able to compare to ground truth answers you trust. And after you cover enough with manual reviews, you'll also be able to construct and LLM-judge that can do the rest.",
     'id': '1400203650751533136',
     'timestamp': '2025-07-30T19:49:24.865000+00:00'}]}]}

CLI Interface

Helper Functions for Channel Discovery

Interactive