This class will use python 2.7 for all homeworks. Make sure you use python2 and not python3. The easiest way to install everything is to use anaconda on a linux or Mac computer.

Install and configure git and clone the course repository.

If you are already familiar with git, just clone the respository:

  git clone

If you are new to github, follow these directions

Install jupyter

If you install Anaconda, jupyter and almost all the necessary packages are installed for you.

Otherwise, follow the directions for DSE200 software installation skip Startup directions for github and choose the installation directions that are right for you computer.

Install notebook extensions

This step is not required, but extensions can make your work on notebooks significantly easier.

To install a bunch of useful extensions, together with a configurator for managing thses extensions, follow the directions on:

Install python packages

Make sure to install the python package findspark. The typing the following command in the terminal installs the package:

  conda install -c conda-forge findspark=1.0.0
  sudo pip install findspark

If you are using pip instead of anaconda, you also must install the following packages:

  • numpy
  • matplotlib
  • pandas

Test Drive jupyter notebooks

After you have cloned the this classes public github repository (first step in this section) cd into the directory called Classes and the start jupyter by running the command jupyter notebook in the terminal. This should automatically launch jupyter in one of your internet browsers. Try exploring the directory. In the sub-directory 00.Background there is are some useful python notebooks that introduce the pandas package. You could also try to get started on the first small homework in the sub-directory, 0.MemoryLatency.

Install Spark on your computer