Google Site Reliability Engg Tech Talks | NYC Tech Events - GarysGuide

COMING UP

Google Site Reliability Engg Tech Talks Popular Event

Paul Jaffre (Dev Experience Enggr, Sentry), Thiara Ortiz (Cloud Gaming SRE, Netflix), Andrew Espira (Platform & Site Reliability Enggr, Kustode).

	Google HQ, 75 9th Ave
	Dec 16 (Tue) , 2025 @ 06:00 PM
	FREE

DETAILS

Google SRE NYC proudly announces our last Google SRE NYC Tech Talk for 2025.

This event is co-sponsored by sentry.io. Thank you Sentry for your partnership!

Let's farewell 2025 with three amazing interactive short talks on Site Reliability & DevOps topics! As always the event will include an opportunity to mingle with the speakers & attendees over some light snacks & beverages after the talks.

The Meetup will take place on Tuesday, 16th of December 2025 at 6:00 PM at our Chelsea Markets office in NYC. The doors will open at 5:30 pm. Pls RSVP only if you're able to attend in-person, there will be no live streaming.

When RSVP'ing to this event, please enter your full name exactly as it appears on your government issued ID. You will be required to present your ID at check in.

Agenda:
Paul Jaffre - Senior Developer Experience Engineer, sentry.io
One Trace to Rule Them All: Unifying Sentry Errors with OpenTelemetry tracing
SREs face the challenge of operating reliable observability infrastructure while avoiding vendor lock-in from proprietary APM (Application Performance Monitoring) solutions. OpenTelemetry has become the standard for instrumenting applications, allowing teams to collect traces, metrics, & logs. But raw telemetry data isn't enough. SREs need tools to visualize, debug, & respond to production incidents quickly. Sentry now supports OTLP, enabling teams to send OpenTelemetry data directly to Sentry for analysis. This talk covers how Sentry's OTLP support works in practice: connecting frontend & backend traces across services, correlating logs with distributed traces, & using tools to identify slow queries & performance bottlenecks. We'll discuss the practical benefits for SREs, like faster incident resolution, better cross-team debugging, & the flexibility to change observability backends without re-instrumenting code.
Paul's background spans engineering, product management, UX design, & open source. He has a soft spot for dev tools & loses sleep over making things easy to understand & use.
Paul has a dynamic professional background, from strategy to stability. His time at Krossover Intelligence established a strong foundation by blending Product Management with hands-on development, & he later focused on core reliability at MakerBot, where he implemented automated end-to-end testing & drove performance improvements. He then extended this expertise in stability & scale at Cypress.io, where he served as a Developer Experience Engineer, focusing on improving workflow, contribution, & usability for their widely adopted open-source community.

Thiara Ortiz - Cloud Gaming SRE Manager, Netflix
Managing Black Box Systems
SREs often face ambiguity when managing black box systems (LLMs, Games, Poorly Understood Dependencies). We will discuss how Netflix monitors service health as black boxes using multiple measurement techniques to understand system behavior, aligning with the need for robust observability tools. These strategies are crucial for system reliability & user experience. By proactively identifying & resolving issues, we ensure smoother playback experience & maintain user trust, even as the platform continues to evolve & gain maturity. The principles shared within this talk can be expanded to other applications such as AI reliability in data quality & model deployments.

Thiara has worked at some of the largest internet companies in the world, Meta & Netflix. During her time at Meta, Thiara found a passion for distributed systems & bringing new hardware into production. Always curious to explore new solutions to complex problems, Thiara developed Fleet Scanner, internally known as Lemonaid, to perform memory, compute, & storage benchmarks on each Meta server in production. This service runs on over 5 million servers & continues to be utilized at Meta. Since Meta, Thiara has been working at Netflix as a Senior CDN Reliability engineer, & now, Cloud Gaming SRE Manager. When incidents occur & Netflix's systems do not behave as expected, Thiara can be found working & engaging the necessary teams to remediate these issues.

Andrew Espira - Platform & Site Reliability Engineer, Founding Engineer kustode
ML-Powered Predictive SRE: Using Behavioral Signals to Prevent Cluster Inefficiencies Before They Impact Production
SREs managing ML clusters often discover resource inefficiencies & queue bottlenecks only after they've impacted production services. This talk presents a machine learning approach to predict these issues before they occur, transforming SRE from reactive firefighting to proactive system optimization.
We demonstrate how to build predictive models using production cluster traces that identify two critical failure modes: (1) GPU under-utilization relative to requested resources, & (2) abnormal queue wait times that indicate impending service degradation.
The SRE practitioners will learn how to extract early warning indicators from standard cluster logs, build ML models that provide actionable confidence scores for operational decisions, & take practical steps to integrate predictive analytics into existing SRE toolchains to achieve 50%+ reduction in resource waste & queue-related incidents
This talk bridges the gap between traditional SRE observability & modern predictive analytics, showing how teams can evolve from reactive monitoring to intelligent, forward-looking reliability engineering"
Andrew has over 8 years of experience architecting & maintaining large-scale distributed systems. He is the Founding Engineer of Kustode (kustode.com), where he develops cutting-edge reliability & observability solutions for modern infrastructure in the Insurance & health care solutions space. Currently pursuing graduate studies in Data Science at Saint Peter's University, he specializes in the intersection of reliability engineering & artificial intelligence. His research focuses on applying machine learning to operational challenges, with publications in peer-reviewed venues including ScienceDirect. He's passionate about making complex systems more predictable & maintainable through data-driven approaches.
When not optimizing cluster performance or building the next generation of observability tools, Andrew enjoys contributing to open-source projects & mentoring early-career engineers in the SRE community.

Our Tech Talks series are for professional development & networking: no recruiters, sales or press please! Google is committed to providing a harassment-free & inclusive conference experience for everyone, & all participants must follow our Event Community Guidelines. The event will be photographed & video recorded.

Event space is limited! A reservation is required to attend. Reserve your spot today & share the event details with your SRE/DevOps friends