Homework submission website

The URLs of the homework testing/submission websites will be released on Friday 04/07/2017.

Please read the Help page before testing/submission.

Self-contained applications

Homework submissions must be self-contained applications. PySpark shell and Jupyter Notebook create a SparkContext sc for you, so you don’t have to create it. But a self-contained application needs to create its own SparkContext object.

In most cases, you can create a self-contained application following this instruction.

(1) Start a PySpark shell with Jupyter Notebook as its front end. We provided a script in Installation Guide for Linux/Mac OS X users. Windows users can start PySpark shell following the Windows installation guide.

if [[ $# -eq 0 ]] ; then
    echo 'Error: Please specify the work directory.'
    exit 0
fi

export SPARK_PATH=<PATH_TO_SPARK_DIRECTORY>
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
# Uncomment next line if the default python on your system is python3
# export PYSPARK_PYTHON=python3
cd "$1"
$SPARK_PATH/bin/pyspark --master local[2]

(2) Write your Spark application using the SparkContext object sc created by PySpark shell.

(3) Export your application by clicking File -> Download as -> Python. Now you should have your code in a .py file.

(4) Open the .py file. Add the following code at the very begining.

# Name: <Your Name>
# Email: <Your email>
# PID: <Your PID>
from pyspark import SparkContext
sc = SparkContext()

(5) Submit this .py to the homework server (will be announced later).

It is important that you don’t set master and appName parameters when you create a SparkContext. We will set them using bin/spark-submit script in the evaluation.

What is SparkContext?

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.

The `appName` parameter is a name for your application to show on the cluster UI. `master` is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode. In practice, when running on a cluster, you will not want to hardcode `master` in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process.

– Quote from the Spark Programming Guide.

If you follow the installation guide, and write your code in PySpark shell with Jupyter Notebook as its front end, you don’t have to worry about the SparkContext object yet, because PySpark shell will create it for you.

To submit your code, you need to create a SparkContext object sc. It is important that you don’t set master and appName parameters when you create a SparkContext object. The homework submission server will set them in the submission script.

If you want to test your .py file on your computer, you can use the bin/spark-submit script:

$SPARK_PATH/bin/spark-submit --master local[2] <your_python_file>

In this way, Spark will know you want to run your code locally. It will crush if you simply run it by python <your_python_file>, because it cannot find the master server.

Environment

The testing environment uses Spark 1.6.1, Hadoop 2.4.0, and Python 2.6.9. Besides Python standard libraries, it has also installed numpy, scipy, pandas, and ujson.

Master URL

We use bin\spark-submit to submit your programs to the Spark cluster. We will set the master URL as a prameter of bin\spark-submit. So please don’t set master URL in your program.

Input/Output

Input

The input will be read from HDFS on the homework submission server. We will provide an URL for the input file which should be used to replace the local file path you used in local testings. The input URL for homework 1 is announced on Piazza. For later homework, we will provide it with homework description.

Output

Please print your output to stdout. We will compare the stdout of your program to our solution and grade your homework.

You can send debug messages to stderr. We will provide you the output in stderr. Note that Spark will also send its logs to stderr, so you are going to see a lot more than what you program to print. The grading system will ignore outputs to stderr.