Introduction to Data Science Overview
Data science has become the central approach to tackling data-heavy problems in both business & academia. In this course, students learn how data science is done in the wild, with a focus on data acquisition, cleaning, & aggregation, exploratory data analysis & visualization, feature engineering, & model creation & validation. Students use the Python scientific stack to work through real-world examples that illustrate these concepts. Concurrently, students learn some of the statistical & mathematical foundations that power the data-scientific approach to problem solving.
Who is this course for?
Introduction to Data Science is for anyone with a basic understanding of data analysis techniques & anyone interested in improving their ability to tackle problems involving multi-dimensional data in a systematic, principled way. A familiarity with a programming language is helpful, but unnecessary, if the pre-work for the course is completed (more on that below). No prior advanced mathematical training beyond an introductory statistics course is necessary.
Students should have some experience with Python & have some familiarity with basic statistical & linear algebraic concepts such as mean, median, mode, standard deviation, correlation, & the difference between a vector & a matrix. In Python, it will be helpful to know basic data structures such as lists, tuples, & dictionaries, & what distinguishes them (that is, when they should be used).
Students should skip the pre-work if they can accomplish all of the following:
Write a program in Python that finds the most frequently occurring word in a given sentence.
Explain the difference between correlation & covariance, & why the difference between the two terms matters.
Multiply two small matrices together (e.g. 3X2 & 2X4 matrices).
Otherwise, students should complete the following pre-work (approximately 8 hours) before the first day of class:
Exercises 1-7, 13, 18-21, 27-35, 38,39 of Learn Python The Hard Way.
Videos 1-6 of Linear Algebra review from Andrew Ng's Machine Learning course (labeled as: III. Linear Algebra Review (Week 1, Optional).
The exercises in Chapters 2 & 3 of OpenIntro Statistics.
Upon completing the course, students have:
An understanding of problems solvable with data science & an ability to attack those problems from a statistical perspective.
An understanding of when to use supervised & unsupervised statistical learning methods on labeled & unlabeled data-rich problems.
The ability to create data analytical pipelines & applications in Python.
Familiarity with the Python data science ecosystem & the various tools one can use to continue developing as a data scientist