• This class will use python3 not python2.
  • The easiest way to install everything is to use the Docker instance that we provide. If for whatever reason, you cannot install Docker then we also provide instructions to download and install the required Python implementation and associated libaries.

Setup Using Docker

The purpose of this part is to ensure you all have a working and compatible Python and PySpark installation. In order to avoid potential compatibility issues generated from students using different versions than the expected, we provide a Docker image with barebones Ubuntu 16.04 and a clean Anaconda 4.3 with python 3.6, jupyter 5.4, spark 2.2 installation. We also provide a script to run the docker image and get Jupyter running on it so that you can program on it directly.

In this guide, we provide instructions on how to install Docker and pull the Docker image. In case you are not able to use Docker, you will have to install Python and Pyspark manually.

- Warning! Read Carefully

  • Using the provided Docker container requires installing Docker, 5-6 GB of free space and root access on the host machine (admin rights for windows).
  • If you are not able to use the provided container, you can install Python and Pyspark on your own. Make sure you follow the instructions given below. We will expect your results to match ours.
  • Docker containers are not intended to store data. We highly recommend you develop your solutions locally and only use docker to compile and run. The following guides show you how to do that. Obtained results should be stored locally as well. If you develop within the container you are at risk of losing your work. You have been warned.
  • When you work within the provided container (interactively or not) you are automatically logged in as a user named jovyan. Your homework notebook is mounted in the directory /home/jovyan/work. If you delete the mounted directory containing your work it will be deleted from the host system. Make sure your work is secure at all times. We recommend you use some sort of version control such as git.

Installing Docker

Installing Docker should be straightforward for Windows and Mac OS users. Mac and Windows users can download it from from the Docker website.

Linux users will have to use this guide. The linux guides essentially try to upgrade your system to a compatible version (for example upgrading to Ubuntu 16.04). Be careful not to break your current system. If you are working with linux, having a Ubuntu 16.04 system should result in an easier docker installation.

For Windows users, Docker will require you enable Hyper-V and restart your computer. Some Windows 10 versions do not have Hyper-V. If you face any issues with installing Docker on Windows, installing Docker Toolbox instead of Docker should be the easiest way out.

Pulling the Docker image

After successful installation of Docker, open a command prompt or shell and execute the following command (Windows users should skip the “sudo” part): Linux:

$ sudo docker pull pupster90/cse255-18

Windows (Powershell prefered):

$ docker pull pupster90/cse255-18

Docker should automatically start downloading and extracting the provided image. If you skip this step the image will automatically be downloaded the first time you attempt to start it. Once finished you can verify you have it by typing Linux:

$ sudo docker images

Windows:

$ docker images

Students using Docker Toolbox for their Windows OS that does not support Hyper-V would now need to execute an additional command to identify their Docker IP address.

$ docker-machine ip

Note down the IP address that is returned as the output. You will need to use this in the next section.

Running Docker Images

Next, download the CSE255-DSE230-2018 Github Repository github directory to some location on your computer. ex: /local/path/to/CSE255-DSE230-2018

If you are already familiar with git, just clone the respository:

  git clone https://github.com/ucsd-edx/CSE255-DSE230-2018.git

If you are new to github, follow these directions

Then run the following following line of code in your terminal (first time might take a while).

$ docker run -it -p 8888:8888 -v /local/path/to/CSE255-DSE230-2018:/home/jovyan/work pupster90/cse255-18 /bin/bash

Notice the terminal has changed, you are now inside a virtual machine. Run the following commands to start jupyter at http://localhost:8888 by issuing the command

$ jupyter notebook

Now you can view notebooks and work on homework at the localhost:8888 port. Students using Docker Toolbox can access the Jupyter notebook running in their Docker container at the :8888 port, where DockerIP is the IP address returned in the previous section.

Whatever changes you make will also happen to /local/path/to/CSE255-DSE230-2018. So when it comes time to submit homework, just copy the files from there.

You can test your setup using the instructions here.

Setup From Scratch

First install the Python 3.6 version using the anaconda distribution.

Install and configure git and clone the course repository.

If you are already familiar with git, just clone the respository:

  git clone https://github.com/ucsd-edx/CSE255-DSE230-2018.git

If you are new to github, follow these directions

Install jupyter

If you install Anaconda, jupyter and almost all the necessary packages are installed for you.

Otherwise, follow the directions for DSE200 software installation skip Startup directions for github and choose the installation directions that are right for you computer.

Install notebook extensions

This step is not required, but extensions can make your work on notebooks significantly easier.

To install a bunch of useful extensions, together with a configurator for managing thses extensions, follow the directions on:

https://github.com/Jupyter-contrib/jupyter_nbextensions_configurator

Install python packages

Make sure to install the python package findspark. The typing the following command in the terminal installs the package:

  Anaconda:
  conda install -c conda-forge findspark=1.0.0
  
  pip:
  sudo pip install findspark

If you are using pip instead of anaconda, you also must install the following packages:

  • numpy
  • matplotlib
  • pandas

  • Some notebooks require additional packages, or packages of a later version. If an import command in a notebook fails, use pip or conda to install the missing package.

Install Spark on your computer

Test drive the class notebooks

  • Clone the public github repository (described above)
  • cd into the root directory of the github repository
  • Start jupyter by running the command jupyter notebook & in the terminal. This should automatically launch jupyter in one of your internet browsers. Explore the notebooks.
  • When you watch the videos on edX that are based on a notebook, follow along on your own copy. The video skips most of the detailed cells. We recommend you study those cells on your own.