DETAILS
8 seats limited.
Dates:
Mondays & Wednesdays |October 5, 7, 14, 19, 21, 26, 28, November 2, 4, 9, 11, 16
(Twelve Classes, Monday and Wednesday Nights)
Time:
7:00-9:30pm
Length of class: 30 hours
Instructor:
Sam Kamin is Associate Professor Emeritus from the University of Illinois Champaign Urbana
where he taught computer science. Most recently he was an engineer at Google before joining NYC Data Science
Academy as VP of Engineering.
Venue:
205 E 42nd Street, New York, NY 10017( 5 min from Grand Central)
Course Overview
An intensive, hands-on introduction to the Hadoop ecosystem of Big Data technologies.
The emphasis in this course is on learning several of the major components of ApacheHadoop HDFS, MapReduce, Hive, Pig, Streaming by doing exercises of increasingcomplexity. Programming will be done in Python.
Students are expected to be familiar with using an operating system from the command line;knowledge of Python is helpful; the material in <<Learn Python the Hard Way>> is sufficientbackground knowledge.
The course format is mixed lecture/lab. Students will need tobring their own laptops to connect to our server; instructions will be provided ahead of timeas to how to install any required software.
What is Hadoop?
Hadoop is an open-source database framework that allows for the processing oflarge data sets using parallel computing methods. Utilizing Googles MapReduceand the Hadoop Distributed File System (HDFS), Hadoop allows for scalability,flexibility and fault tolerance. Hadoop is optimized to handle massive quantitiesof data either structured, semi-structured, or unstructured meaning.
Hadoop is perfect for Big Data. As part of the Apache Framework, there isa host of Apache compliments such as Hive, Pig and Zookeeper, that furtherextend Hadoops applications and usability.
SYLLABUS
Week 1 Introduction: MapReduce
Overview of Big Data and the Hadoop ecosystem
The concept of MapReduce
HDFS Hadoop Distributed File System
MapReduce with Python streaming
Week 2 More on MapReduce
More on Big Data, the Hadoop ecosystem, and MapReduce.
Mixed case studies and exercises using MR with Python streaming
Week 3 Hive: A database for Big Data
Hive concepts
HiveQL
User-defined functions in the Hive language
User-defined functions in Python (using streaming)
Advanced topic: Hive queries in Python code
Week 4 Pig: Simplified MapReduce
Basic concepts
Pig Latin
Pig functions and macros
User-defined functions
Week 5 Spark
Intro to Spark
Intro to Mahout
Week 6 Project day
The Hadoop ecosystem
Brief intro to Spark
Brief intro to Mahout
Case studies/Final projects