LLMadness – March Madness Model Evals

I wanted to play around with the non-coding agentic capabilities of the top LLMs so I built a model eval predicting the March Madness bracket. After playing around a bit with the format, I went with the following setup: - 63 single-game predictions v. full one-shot bracket - Maxed out at 10 tool calls per game - Upset-specific instruction in the system prompt - Exponential scoring by round (1, 2, 4, 8, 16, 32) There were some interesting learnings: - Unsurprisingly, most brackets are close to chalk. Very few significant upsets were predicted. - There was a HUGE cost and token disparity with the exact same setup and constraints. Both Claude models spent over $40 to fill in the bracket while MiMo-V2-Flash spent $0.39. I spent a total of $138.69 on all 15 model runs. - There was also a big disparity in speed. Claude Opus 4.6 took almost 2 full days to finish the 2 play-ins and 63 bracket games. Qwen 3.5 Flash took under 10 minutes. - Even when given the tournament year (2026), multiple models pulled in information from previous years. Claude seemed to be the biggest offender, really wanting Cooper Flagg to be on this year's Duke team. This was a really fun way to combine two of my interests and I'm excited to see how the models perform over the coming weeks. You can click into each bracket node to see the full model trace and rationale behind the picks. The stack is Typescript, Next.js, React, and raw CSS. No DB, everything stored in static JSON files. After each game, I update the actual results and re-deploy via GitHub Pages. I wanted to work as fast as possible since the brackets lock today so almost all of the code was AI-generated (shocker). Hope you enjoy checking it out!

  • AI Agent
  • Data Analytics
  • LLM
Mar 19, 2026Visit website

AI Summary

LLMadness is a project that evaluates the non-coding agentic capabilities of various LLMs by having them predict a March Madness bracket, comparing their performance, cost, speed, and accuracy.

Best For

LLM researchers and evaluators, AI enthusiasts interested in model comparisons, Sports analytics hobbyists

Why It Matters

It provides a practical, real-world comparison of LLMs' prediction abilities, cost efficiency, and speed in a structured tournament format.

Key Features

  • Evaluates LLM agentic capabilities through March Madness bracket predictions
  • Compares model performance across cost, speed, and prediction accuracy
  • Provides detailed model traces and rationales for each game pick
  • Uses a static site with no database, updating results via GitHub Pages

Use Cases

  • A data science instructor creates interactive classroom exercises where students compare LLM prediction methodologies, using the bracket predictions to demonstrate cost-performance tradeoffs in model selection.
  • A sports analytics consultant uses the platform to benchmark different AI models' ability to forecast tournament upsets, providing clients with insights about which models handle edge cases best for their betting algorithms.
  • A product manager at an AI startup references the speed and cost comparisons when deciding which foundation models to integrate for real-time prediction features in their sports app.