Homework submission website
The URLs of the homework testing/submission websites will be released on Friday 04/07/2017.
Please read the Help page before testing/submission.
Self-contained applications
Homework submissions must be self-contained applications.
PySpark shell and Jupyter Notebook create a SparkContext sc
for you, so you don’t have
to create it. But a self-contained application needs to create its own
SparkContext object.
In most cases, you can create a self-contained application following this instruction.
(1) Start a PySpark shell with Jupyter Notebook as its front end. We provided a script in Installation Guide for Linux/Mac OS X users. Windows users can start PySpark shell following the Windows installation guide.
if [[ $# -eq 0 ]] ; then
echo 'Error: Please specify the work directory.'
exit 0
fi
export SPARK_PATH=<PATH_TO_SPARK_DIRECTORY>
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
# Uncomment next line if the default python on your system is python3
# export PYSPARK_PYTHON=python3
cd "$1"
$SPARK_PATH/bin/pyspark --master local[2]
(2) Write your Spark application using the SparkContext object sc
created by
PySpark shell.
(3) Export your application by clicking File -> Download as -> Python. Now you
should have your code in a .py
file.
(4) Open the .py
file. Add the following code at the very begining.
# Name: <Your Name>
# Email: <Your email>
# PID: <Your PID>
from pyspark import SparkContext
sc = SparkContext()
(5) Submit this .py
to the homework server (will be announced later).
It is important that you don’t set master
and appName
parameters when you
create a SparkContext. We will set them using bin/spark-submit
script in the
evaluation.
What is SparkContext?
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.
The `appName` parameter is a name for your application to show on the cluster UI. `master` is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode. In practice, when running on a cluster, you will not want to hardcode `master` in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process.
– Quote from the Spark Programming Guide.
If you follow the installation guide, and write your code in PySpark shell with Jupyter Notebook as its front end, you don’t have to worry about the SparkContext object yet, because PySpark shell will create it for you.
To submit your code, you need to create a SparkContext object sc
.
It is important that
you don’t set master
and appName
parameters when you create a SparkContext
object. The homework submission server will set them in the submission script.
If you want to test your .py
file on your computer, you can use the bin/spark-submit
script:
$SPARK_PATH/bin/spark-submit --master local[2] <your_python_file>
In this way, Spark will know you want to run your code locally. It will crush
if you simply run it by python <your_python_file>
, because it cannot find
the master server.
Environment
The testing environment uses Spark 1.6.1, Hadoop 2.4.0, and Python 2.6.9.
Besides Python standard libraries,
it has also installed numpy
, scipy
, pandas
, and ujson
.
Master URL
We use bin\spark-submit
to submit your programs to the Spark cluster. We will
set the master URL as a prameter of bin\spark-submit
. So please don’t set master
URL in your program.
Input/Output
Input
The input will be read from HDFS on the homework submission server. We will provide an URL for the input file which should be used to replace the local file path you used in local testings. The input URL for homework 1 is announced on Piazza. For later homework, we will provide it with homework description.
Output
Please print your output to stdout
. We will compare the stdout
of your program
to our solution and grade your homework.
You can send debug messages to stderr
. We will provide you the output in stderr
.
Note that Spark will also send its logs to stderr
, so you are going to see a lot
more than what you program to print. The grading system will ignore outputs to stderr
.