Instructions tested with Windows 10 64-bit.

It is highly recommend that you use Mac OS X or Linux for this course, these instructions are only for people who cannot run Mac OS X or Linux on their computer.

Table of Contents

Install and Setup

Spark provides APIs in Scala, Java, Python (PySpark) and R. We use PySpark and Jupyter, previously known as IPython Notebook, as the development environment. There are many articles online that talk about Jupyter and what a great tool it is, so we won’t introduce it in details here.

This Guide Assumes you already have Anaconda and Gnu On Windows installed. See https://mas-dse.github.io/startup/anaconda-windows-install/

1. Go to http://www.java.com and install Java 7+.

2. Get Spark pre-built package from the downloads page of the Spark project website.

3. Open PowerShell by pressing ⊞ Win-R, typing “powershell” in Run dialog box and clicking “OK”. Change your working directory to where you downloaded the Spark package.

4. Type the commands in red to uncompress the Spark download. Alternatively, you can use any other software of your preference to uncompress.

> gzip -d spark-2.1.0-bin-hadoop2.7.tgz
> tar xvf spark-2.1.0-bin-hadoop2.7.tar

5. Type the commands in red to move Spark to the c:\opt\spark\ directory.

> mkdir C:\opt\
> move spark-2.1.0-bin-hadoop2.7 C:\opt\spark\

6. Type the commands in red to download winutils.exe for Spark.

> cd C:\opt\spark\bin\
> curl -k -L -o winutils.exe https://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe?raw=true

7. Create an environment variable with variable name = SPARK_HOME and variable value = C:/opt/spark. This link provides a good description of how to set environment variable in windows

8. Type the commands in red to create a temporary directory.

> mkdir ~/Documents/jupyter-temp/
> cd ~/Documents/jupyter-temp/

9. Type the commands in red to install, configure and run Jupyter Notebook. Jupyter Notebook will launch using your default web browser.

> conda install jupyter -y
> ipython kernelspec install-self
> jupyter notebook

First Spark Application

In our first Spark application, we will run a Monte Carlo experiment to find an estimate for $\pi$.

Here is how we are going to do it. The figure bellow shows a circle with radius $r = 1$ inscribed within a 2×2 square. The ratio between the area of the circle and the area of the square is $\frac{\pi}{4}$. If we sample enough points in the square, we will have approximately $\rho = \frac{\pi}{4}$ of these points that lie inside the circle. So we can estimate $\pi$ as $4 \rho$.

circle

1. Create a new Notebook by selecting Python 2 from the New drop down list at the right of the page.

New Notebook

2. First we will create the Spark Context. Copy and paste the red text into the first cell then click the (run cell) button:

import os
import sys

import findspark
findspark.init()

from pyspark import SparkContext

sc = SparkContext(master="local[4]")

3. Next, we draw a sufficient amount of points inside the square. Copy and paste the red text into the next cell then click the (run cell) button:

import numpy as np

TOTAL = 1000000
dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache()
print("Number of random points:", dots.count())

stats = dots.stats()
print('Mean:', stats.mean())
print('stdev:', stats.stdev())

Output:

('Number of random points:', 1000000)
('Mean:', array([-0.0004401 , 0.00052725]))
('stdev:', array([ 0.57720696, 0.57773085]))

4. We can sample a small fraction of these points and visualize them. Copy and paste the red text into the next cell then click the (run cell) button:

%matplotlib inline
from operator import itemgetter
from matplotlib import pyplot as plt

plt.figure(figsize = (10, 5))

# Plot 1
plt.subplot(1, 2, 1)
plt.xlim((-1.0, 1.0))
plt.ylim((-1.0, 1.0))

sample = dots.sample(False, 0.01)
X = sample.map(itemgetter(0)).collect()
Y = sample.map(itemgetter(1)).collect()
plt.scatter(X, Y)

# Plot 2
plt.subplot(1, 2, 2)
plt.xlim((-1.0, 1.0))
plt.ylim((-1.0, 1.0))

inCircle = lambda v: np.linalg.norm(v) <= 1.0
dotsIn = sample.filter(inCircle).cache()
dotsOut = sample.filter(lambda v: not inCircle(v)).cache()

# inside circle
Xin = dotsIn.map(itemgetter(0)).collect()
Yin = dotsIn.map(itemgetter(1)).collect()
plt.scatter(Xin, Yin, color = 'r')

# outside circle
Xout = dotsOut.map(itemgetter(0)).collect()
Yout = dotsOut.map(itemgetter(1)).collect()
plt.scatter(Xout, Yout)

Output:

<matplotlib.collections.PathCollection at 0x17a78780>

sample output

5. Finally, let’s compute the estimated value of $\pi$. Copy and paste the red text into the next cell then click the (run cell) button:

pi = 4.0 * (dots.filter(inCircle).count() / float(TOTAL))
print("The estimation of \pi is:", pi)

Output:

('The estimation of \\pi is:', 3.142204)

Next Steps

References