Section 1: Introduction

  • What is data science?
  • Computation locality and the memory Hierarchy.

Section 2: Distributed computation using Map Reduce

  • map-reduce , RDDs
  • counting words example, loading, processing, collecting.
  • Park-SQL and DataFrames

Section 3: Analysis based on squared error:

  • PCA
  • PCAs with Nan entries
  • The weather database and it’s analysis using PCA.
  • Combining effects and Percentage Variance Explained
  • Regression

Section 4: Clustering and intrinsic dimension

  • K-means
  • K-means++ and intrinsic dimension.
  • The graph laplacian

Section 5: Classification:

  • Logistic regression
  • Tree-based regression
  • Ensamble methods for classification
    • Random forests
    • gradient boosted trees
    • Boosting and resampling.