| |
|
| |
| DETAILS |
|
We are excited to welcome Suman Debnath, Technical Lead in Machine Learning at Anyscale, for a practical & intuitive introduction to distributed training.
Talk Description:
As modern AI models continue to grow, single-GPU training is no longer enough. Distributed training has become essential, but scaling models introduces challenges that require understanding communication patterns, system bottlenecks, & key trade-offs.
In this session, we will break down distributed training from first principles. We will explore why single-GPU training hits limits, how transformer models manage memory, & what techniques like gradient accumulation, checkpointing, & data parallelism actually do.
We will also demystify communication primitives, walk through ZeRO-1, ZeRO-2, ZeRO-3 & FSDP, & show how compute & communication can be overlapped for better efficiency. Finally, we will connect these concepts to real-world tooling used in frameworks like Ray & PyTorch. Attendees will gain a clear, grounded understanding of how distributed training works & when to apply different strategies.
Bio:
Suman Debnath is a Technical Lead in Machine Learning at Anyscale, where he works on large-scale distributed training, fine-tuning, & inference optimization in the cloud. His expertise spans Natural Language Processing, Large Language Models, & Retrieval-Augmented Generation.
He has also spoken at more than one hundred global conferences & events, including PyCon, PyData, & ODSC, & has previously built performance benchmarking tools for distributed storage systems.
We look forward to seeing you!
#DataScience #MachineLearning #DistributedTraining #Ray #PyTorch #LLM #RAG #DeepLearning #USFCA #USFMSDSAI #DataInstitute #AIEngineering #TechTalk
|
|
|
|
|
|
|
|