|
|
| |
EVENT DETAILS |
This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.
Course Overview
This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.
Instructor: Dr. Sam Kamin
Sam Kamin was a tenured professor of Computer Science from the University of Illinois Urbana Champaign for over 30 years, doing research in the areas of programming languages, high performance computing, and educational technology. He is the author or co-author of several books, including a textbook on Java programming. After leaving the U of I, he joined Google in their New York office as Senior Software Engineer in the Tech Infrastructure department, which is the team responsible for the tools used to launch nearly all computations in Google. After gaining intimate knowledge of Googles cluster, Sam left Google to return to what he loves best teaching people about computers. Sam leads the data engineering educational programs at NYC Data Science Academy, and supports the consulting practice.
What is Hadoop?
Hadoop is a set of open-source programs running in computer clusters that simplify the handling of large amounts of data. Originally, Hadoop consisted of a distributed file system tuned for large data sets and an implementation of the MapReduce parallelism paradigm, but has expanded in many ways. It now includes database systems, languages for parallelism, libraries for machine learning, its own job scheduler, and much more. Furthermore, MapReduce is no longer the only parallelism framework; Spark is an increasingly popular alternative. In summary, Hadoop is a very popular and rapidly growing set of cluster computing solutions, which is becoming an essential tool for data scientists.
Syllabus
Week 1 Introduction: Hadoop, MapReduce, Python
Overview of Big Data and the Hadoop ecosystem
The concept of MapReduce
HDFS Hadoop Distributed File System
Python for MapReduce
Week 2 MapReduce
More Python for MapReduce
Implementing MapReduce with Python streaming
Week 3 Hive: A database for Big Data
Hive concepts, Hive query language (HiveQL)
User-defined functions in Python (using streaming)
Accessing Hive from Python
Week 4 & 5 Spark
Intro to Spark using PySpark
Basic Spark concepts: RDDs, transformations, actions
PairRDDs and aggregating transformations
Advanced Spark: partitions; shared variables
SparkSQL
Week 6 Project Week
Case studies/Final projects
|
|
|
|
|
|