| |
|
| |
Join the Snorkel AI Reading Group, a recurring forum to explore the latest frontier developments in AI while building meaningful connections within the community.
In this afternoon session, Yiyou Sun & Xinyang Han, Postdoctoral Researchers at UC Berkeley, will cover their recent paper: Agents' Last Exam.
Agenda:
4 pm - doors open
4:30 pm - talk begins
Boba tea & other refreshments will be provided !
Among other things, you'll learn:
ALE is a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes-developed in collaboration with 250+ industry experts & covering 1,000+ tasks across 55 subfields in 13 industry clusters.
Widely-used benchmarks lack sustained performance measurement on real, economically valuable workflows, creating a systematic gap between benchmark success & meaningful deployment across professional domains.
ALE grounds task coverage in O*NET / SOC 2018, the U.S. federal occupational taxonomy, ensuring systematic, reproducible coverage of non-physical job categories at scale.
The hardest task tier remains far from saturated-across mainstream harness & backbone configurations, the average full pass rate is just 2.6%, underscoring the substantial headroom that remains.
ALE's task pool grows continuously as new workflows & industries are onboarded, enabling longitudinal tracking of agent capabilities rather than one-time snapshot comparisons.
ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark performance & GDP-relevant economic impact.
Agents' Last Exam is a collaboration between UC Berkeley's RDI (Center for Responsible Decentralized Intelligence), Snorkel AI, & 250+ industry experts across academia & industry.
|
|
|
|
|
|
|
|