1) Model metrics (harmonic mean of accelerations across all tasks) 📊
2) Metrics change depending on budget 💰. You can see that smaller models slow down after hitting $0.50, with the biggest gains happening early on. Honestly, even with a budget of $0.40, you can still make some solid improvements! 🔥
Fun fact: o4-mini / R1 scores better at $0.10 than Opus does for a full dollar! 💡
Budget-constrained benchmarks are super interesting—though they can be a bit limiting in real-world use. But hey, the biggest changes usually come from pricier models. Still, this is definitely a solid direction for student projects! 🎓✨