introduction

  1. Slides What is Data Science?
  2. Slides Memory Latency and Distributed data analysis
  3. Slides Map Reduce (slideshow in browser) (pdf)

Spark Basics

  1. Slides RDDs Plain and (Key,Value) pairs (slideshow in browser) (pdf)
  2. Slides Spark intro (slideshow in browser) (pdf)
  3. Notebook Spark Basics 1 (Slideshow in browser) (pdf)
  4. Notebook Spark Basics 2 (Slideshow in browser) (pdf)

Spark Architecture

  1. Slides Word Count using Spark (Slideshow in browser) (pdf)
  2. Slides Distributed sort (pdf)
  3. Slides Spark Architecture (slideshow in browser) (pdf)
  4. Slides Partitioners and Glom (Slideshow in browser) (pdf)
  5. Notebook Execution plans, Lazy Evaluation, caching and Gloming (Slideshow in browser) (pdf)

Advanced Spark

  1. Notebook More RDD operations (Slideshow in browser) (pdf)
  2. Notebook Spark-SQL (Slideshow in browser) (pdf)

Methods based on Square error

  1. PCA and SVD
  2. Regression and SGD
  3. K-means

Classification methods

  1. Decision Trees
  2. Boosting
  3. Ensembles
  4. Robust-Boost
  5. ActiveLearning.pdf

Dimensionality and low-dimensionality Embeddings

  1. Notions of Dimensionality