Section 1: Basics

  1. What is data science?
  2. Computation locality and the memory Hierarchy.
  3. map-reduce , RDDs
  4. counting words example, loading, processing, collecting.

Section 2: DataFrames and PCA

  • DataFrames, Spark-SQL, Parquet.
  • PCA, Working Nan entries
  • The weather database and it’s analysis using PCA.
  • Combining effects and Percentage Variance Explained

Section 3: Clustering and intrinsic dimension

  • K-means
  • K-means++ and intrinsic dimension.
  • Non-linear dimensionality reduction.
    • Locally linear embeddings
    • Spectral analysis - The graph Laplacian

Section 4: Classification:

  • Logistic regression
  • Tree-based regression
  • Ensamble methods for classification
    • Random forests
    • gradient boosted trees
    • Boosting and resampling.

Section 5: Deep Neural Networks and Tensor-Flow

  • DNNs: the good, the bad and the ugly.
  • TensorFlow.
  • Convolutional Networks.
  • Auto-Encoders.