| |
|
|
DETAILS |
|
Evaluating large language models (LLMs) can be a daunting task, & when it comes to agentic systems, the complexity increases exponentially. In this second part of our community series with Arize AI, we will explore why traditional LLM evaluation metrics fall short when applied to agents & introduce modern LLM evaluation techniques that are built for this new paradigm.
From code-based evaluations to LLM-driven assessments, human feedback, & benchmarking your metrics, this session will equip you with the necessary tools & practices to assess agent behavior effectively. You will also get hands-on experience with Arize Phoenix & learn how to run your own LLM evaluations using both ground truth data & LLMs.
What We Will Cover:
- Why standard metrics like BLEU, ROUGE, or even hallucination detection arent sufficient for evaluating agents.
- Core evaluation methods for agents: LLM evaluations using code-based evaluations, LLM-driven assessments, human feedback & labeling, & ground truth comparisons.
- How to write high-quality LLM evaluations that align with real-world tasks & expected outcomes.
- Building & benchmarking LLM evaluations using ground truth data to validate their effectiveness.
- Best practices for capturing telemetry & instrumenting evaluations at scale.
- How OpenInference standards (where applicable) can improve interoperability & consistency across systems.
- Hands-on Exercise: Judge a sample agent run using both code-based & LLM-based evaluations with Arize Phoenix.
Ready for Part 3 of the series? Find it here!
|
|
|
|
|
|
|
|