| |
|
| |
| DETAILS |
|
Welcome back to another year of open source programming with PyData NYC We've got a new venue partnership with St.John's at 101 Astor Pl, New York, NY 10003. Join us on Feb 17th at 6:30 pm for a talk/demo night with Ming Zhao (IBM) & Andy Walner/Chandra Krishnan from OneHouse.
Please bring your to code & sign up with your government official name.
Pizza & drinks sponsored by IBM - thank you!
Agenda:
Unlocking Document Intelligence with Docling
Speaker: Ming Zhao (Developer Advocate at IBM)
Most organizational knowledge is still locked inside complex documents, making it difficult to extract & use the information effectively. Traditional tools often fail when working with real-world document formats, particularly PDFs. Tables lose their structure, figures get separated from captions, & multi-column layouts become unreadable text. These failures make it difficult to bring AI to document-heavy workflows. Docling is an open-source project that takes a different approach, using deep learning models to parse documents the way humans read them. It preserves hierarchy, extracts structured data through a consistent API, & supports 15+ file formats out of the box. In this session we'll explore how you can leverage Docling in your own AI workflows.
From OLAP to AI: How Hudi Brings Vector Search Directly to the Data Lakehouse
Speakers: Andy Walner (Product Manager at Onehouse) & Chandra Krishnan (Sales Engineering at Onehouse)
Vector search is rapidly becoming table stakes for AI workloads, but most teams are forced to bolt a separate vector database onto their lakehouse. In this talk, we introduce a new capability in Apache Hudi that brings vector support directly into the data lake, merging large-scale analytics & AI workloads in a single system.
We will demo native vector search on Hudi tables using PySpark, including a new vector search function that runs directly on lake data. You will see how swapping the base file format from Parquet to Lance unlocks better support for unstructured data & faster vector retrieval, while preserving warehouse-style analytics on the same tables.
This approach enables use cases like RAG, similarity search, & AI training directly on existing OLAP data, without duplicating data or introducing new storage systems. The design is engine-agnostic. While the demo uses PySpark, the same data can be processed with Ray, Daft, or other compute engines, pointing toward a single lakehouse architecture that supports both structured & unstructured data for analytics & AI.
Networking
Connect with fellow data enthusiasts, professionals, & community leaders. Build meaningful connections & forge collaborations.
----------------------------------------------------------------
Doors open @ 6 pm
Doors close @ 7 pm
Event @ 6:30 - 8:30 pm
Venue provided by St John's: 101 Astor Pl, New York, NY 10003
----------------------------------------------------------------
The building requires a government-issued photo ID for entrance. This, & all PyData NYC events, is an all-level event. Newcomers & beginners are welcome.This & all NumFOCUS-affiliated events & spaces, both in-person & online, are governed by a Code of Conduct.
----------------------------------------------------------------
This event may be recorded.
|
|
|
|
|
|
|
|