Section 1: Introduction
- What is data science?
- Computation locality and the memory Hierarchy.
Section 2: Distributed computation using Map Reduce
- map-reduce , RDDs
- counting words example, loading, processing, collecting.
- Park-SQL and DataFrames
Section 3: Analysis based on squared error:
- PCAs with Nan entries
- The weather database and it’s analysis using PCA.
- Combining effects and Percentage Variance Explained
Section 4: Clustering and intrinsic dimension
- K-means++ and intrinsic dimension.
- The graph laplacian
Section 5: Classification:
- Logistic regression
- Tree-based regression
- Ensamble methods for classification
- Random forests
- gradient boosted trees
- Boosting and resampling.