NYC Tech Events - GarysGuide | The #1 Resource for NYC Tech

EVENT DETAILS

register at https://us02web.zoom.us/webinar/register/WN_k0nytglMTLSp-UJJEfAXLw

We have two exciting talk. Eric Sun from LinkedIn will discuss the missing story of using columnar format: how to fully take advantage of the columnar format. Kaige from Kyligence will discuss Apache Kylin query engine will sub-seconds response time.

Agenda
12: 00 -- 12:05 pm Introduction
12:05 -- 12: 40 pm Talk1 + QA
12:40 -- 01: 15 pm. Talk2 + QA
1:15 -- 1:30 pm closing

Talk 1: Are We Taking Only Half Of The Advantage Of Columnar File Format?

Offline data ecosystem is mainly for batch (small ~ huge) ETL/analytics/ML/DL, which means the majority of useful data files are ingested once & then scanned/read hundreds of thousands times. More than 90% of workload on HDFS/S3/ADLS are read, therefore it is very important to optimize for the read operations. To simply keep the data format & schema identical as online upstreams (Kafka, RDBMS, Cassandra & MongoDB) in Avro or JSON can actually prevent us from leveraging modern compute engines & all related optimization. Switching to columnar format (such as Parquet or ORC) is only about half-way to get more done for less, this meetup talk will explain the other half. Among many areas related to storage, the following optimization can give a Data Lake the most significant ROI with relatively low investment:
sorting for the mostly-filtered field (w/ low~medium carnality)
bucketing the big dimension/lookup tables which are (remove shuffle stage for joins
or simply distributing the records by the almost unique field w/o bucketing
sub-partition (multi-level partition) for big & frequently-used tables
rolling hourly partitions into daily instead of daily compaction

Speaker: Eric Sun (LinkedIn)

Talk 2 : Apache Kylin: Achieve Exact COUNT DISTINCT with Sub-Second Latency at PB Scale

With over 450 million customers, Didi (worlds largest rideshare company) conducts complex user behavior analysis on huge datasets daily. Exact Count Distinct is one of Didis most critical metrics, but it is known for being computationally heavy & notoriously slow. The difference between exact Count Distinct & approximate Count Distinct can cost Didi millions of dollars. In this talk, Kaige Liu of the Apache Kylin project will explain how Didi uses Apache Kylin to return exact Distinct Count on billions of rows of data with sub-second latency to generate the most accurate picture of its business.

You will also learn about the latest development in modern OLAP technologies. Kaige will share how Didi & Truck Alliance (a truck-hailing company that processes $100 billion worth of goods yearly) use Apache Kylin to power their analytics platforms that allow 100s of analysts to achieve sub-second latency on petabyte-scale data.

Speaker: Kaige (Kyligence)

Kaige is a senior solutions architect at Kyligence, where he works on building the next-generation big data analytics platform. Previously, he worked on the OpenStack & Bluemix team at IBM, focusing on cloud computing & virtualization technology. Kaige loves the open source community & is an active Apache Kylin committer.