Posted on 22 Sep 2018
This post contains the steps that I followed to install Spark on my Ubuntu 18.04 (Bitonic). I first tried with the pre-built version available on their site, but it threw an error when I tried to play with it. After that, I tried with the more serious alternatives: installation with Maven, which did not work either, and with SBT, which worked after some hours of struggle.
The first thing is to download and extract the .tgz file from the official website
wget https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1.tgz tar -zxvf spark-2.3.1.tgz cd spark-2.3.1
Check that we have Java 8 or higher and whether we have the JAVA_HOME environment variable adequately defined:
java --version echo $JAVA_HOME
Java 9 (the default in Ubuntu 18) might give problems with Spark, so some people install Java 8. Oracle Java 8 is not in the official repositories of Ubuntu 18. Thus, I added the repositories and then installed it.
sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt install oracle-java8-set-default
Check that java and javac use both Oracle 8:
sudo update-alternatives --config java sudo update-alternatives --config javac
And then check that the environment variable JAVA_HOME is also set to Java 8. Modify it if necessary:
echo $JAVA_HOME export JAVA_HOME="/usr/lib/jvm/java-8-oracle/"
Add the last line to the ~/.bashrc file to make it permanent.
pyspark, the Python interface to Spark to write to execute pyspark scripts (python calling spark)
pip install pyspark
My attempt to build Spark using Maven or SBT did not work. Some of the compiling errors looked like it had something to do with Scala. So I installed Scala (SBT) following the instructions from their site.
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823 sudo apt-get update sudo apt-get install sbt
Finally, I could install spark with
Then I moved the complete spark directory to the /opt/ directory
sudo mv spark-2.3.1 /opt/spark-2.3.1 sudo ln -s /opt/spark-2.3.1 /opt/spark
With this symbolic link, we can install multiple Spark versions in the /opt/ directory, and the link (that we will use in the environment variables) will point to the active one. Finally, we set the SPARK environment variables in our ~/.bashrc file:
export SPARK_HOME=/opt/spark export PATH=$SPARK_HOME/bin:$PATH
After these steps, I could start playing with the quick start tutorial. I entered the pyspark shell:
And then everything seemed to work fine:
>>> textFile = spark.read.text("/opt/spark/README.md") >>> textFile.count() 103 >>> textFile.first() Row(value='# Apache Spark')
However, I don’t like shells; I prefer Jupyter notebooks. Let’s see how we can use pyspark from a Jupyter network. A first option is to install findspark so that we can find Spark from any Jupiter notebook:
pip install findspark
Then we can start our code in a Jupyter notebook as
import findspark findspark.init() import pyspark
A second alternative is to let Jupyter notebook be the default shell for pyspark, adding to our ~/.bashrc file:
export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Now we can call pyspark from the command line and write our first program in a notebook. The following code approximates using a Monte Carlo algorithm. We draw random points within a unit square and count how many of them fall inside the curve . We are working in one of the quadrants of a circle with radius 1. The area of the whole circle is, by definition, . Therefore can be approximated as . The code is
import random num_samples = 100000000 def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 count = sc.parallelize(range(0, num_samples)).filter(inside).count() pi = 4 * count / num_samples print(pi) sc.stop() # 3.14109824
Each parallel worker draws a sample and communicates whether if felt inside or outside the circle curve. Spark counts the number of positive responses from the workers, and then we compute the total proportion to approximate .
Some links I visited during the installation: