NYC  SF        Events   Jobs   Deals  
    Sign in  
 
 
NYC Tech
Events Weekly Newsletter!
*
 
COMING UP

Farcon NYC
(Apr 30 - May 04)

NYC Health Innovation Week
(May 05 - May 09)

NYCxDESIGN
(May 14 - May 21)

NY Tech Week
(May 31 - Jun 08)
 
 
 
 
 
 
 
 
 
 
 
Venue, Online
May 14 (Wed) , 2025 @ 05:00 PM
FREE
 
Register
 
 

 
DETAILS

Evaluating large language models (LLMs) can be a daunting task, & when it comes to agentic systems, the complexity increases exponentially. In this second part of our community series with Arize AI, we will explore why traditional LLM evaluation metrics fall short when applied to agents & introduce modern LLM evaluation techniques that are built for this new paradigm.


From code-based evaluations to LLM-driven assessments, human feedback, & benchmarking your metrics, this session will equip you with the necessary tools & practices to assess agent behavior effectively. You will also get hands-on experience with Arize Phoenix & learn how to run your own LLM evaluations using both ground truth data & LLMs.


What We Will Cover:



  • Why standard metrics like BLEU, ROUGE, or even hallucination detection arent sufficient for evaluating agents.

  • Core evaluation methods for agents: LLM evaluations using code-based evaluations, LLM-driven assessments, human feedback & labeling, & ground truth comparisons.

  • How to write high-quality LLM evaluations that align with real-world tasks & expected outcomes.

  • Building & benchmarking LLM evaluations using ground truth data to validate their effectiveness.

  • Best practices for capturing telemetry & instrumenting evaluations at scale.

  • How OpenInference standards (where applicable) can improve interoperability & consistency across systems.

  • Hands-on Exercise: Judge a sample agent run using both code-based & LLM-based evaluations with Arize Phoenix.


Ready for Part 3 of the series? Find it here!

 
 
 
 
About    Feedback    Press    Terms    Gary's Red Tie
 
© 2025 GarysGuide