Collaborative Filter based Recommendation System using Spark on AWS EC2

6 min readDec 5, 2020

In this article I’ll be going over how to implement a collaborative filter based recommendation using spark. We will be setting up an AWS EC2 instance to run out model and connect to it via SSH. Before we get started with building the recommendation system on spark, we need to set up the environment.

Environment Setup

1. First, we need to create an ec2 instance on AWS

2. Use SSH to connect to ec2 instance

3. Set-up spark and jupyter notebook on EC2 Instance

Creating an EC2 Instance:

Step 1: Choose an AMI — Create an instance Image as Ubuntu Server 18.04 LTS. Setting up the

Step 2: Choose Instance type- select the free-tier t2.micro instance type

Step 3: Configure Instance Details- keep everything to default. The set-up is done on AWS educate for we also want to keep charges to a minimum. For large datasets we can increase the number of instances for more clusters.

Step 4: Add Storage — Keep everything to default

Step 5: Tag Instance — Add a name to the instance and put any name under value.

Step 6: Security Group — Set up the security group as shown below. Can change the IP to your IP for more security but we will be adding a password to jupyter

Step 7: Review and launch — Create a new key pair, if it’s first time you’re creating the instance or select an existing key par.

Note: Key-pair can only be downloaded once

SSH has two ways for authenticating:

1. Use username and password

2. Key-pair (SSH for AWS servers is set by default to not allow log in using user-pass. Have to use SSH Key-pair)

Key-pair consists of private version of the key and public version of the key.

Private key stays with us and it is used to log in and public key is used to set up the AWS instance

Using SSH to connect to EC2 Instance:

Download PuTTY from https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html

Using this we will be connecting to our created instance. When PuTTY is downloaded and installed three applications are installed:

1. PuTTY — used for connecting to EC2 via SSH

2. PuTTYgen — used for generating a .ppk key

3. Pageant — used for loading the key

SSH key AWS gave us is in .pem format, which PuTTY doesn’t recognize — USE PuTTYgen to transform it into usable format (.ppk)

Use passphrase to save the private key

Before connecting to server via SSH open the .ppk key in Pageant using passphrase

Copy the public IPv4 address of the server and paste it into PuTTY and the username will be ubuntu and then open to connect to the ubuntu instance.

Click open and now the EC2 instance should be connect via SSH

Setting up Spark and Jupyter on EC2 Instance

Tasks on EC2 Instance:

1. Setup Jupyter Notebook

2. Download and Install Spark

3. Connect with PySpark

4. Access EC2 Jupyter Notebook using local computer

Setting up Jupyter Notebook:

Start by executing sudo apt-get update when the instance is connected

sudo apt install python3-pip to install pip3

sudo apt-get install jupyter to install jupyter

wget https://repo.anaconda.com/archive/Anaconda3-2020-11-Linux-x86_64.sh

bash Anaconda3-2020.11-Linux-x86_64.sh to run the file. Then accept the terms and the default location home/ubuntu/anaconda3

The default python that will be used will be the python we downloaded, but we want to use the python from anaconda. To check the python being used we can use the command which python3 and the it should output /usr/bin/python3

If we use ls -a we can see the root file .bashrc. We need to edit this and add our default python path which is from anaconda3. To do this run sudo nano .bashrc and enter the following line of code export PATH=”/home/ubuntu/anaconda3/bin:$PATH”. Then save changes and exit out of the editor using CTRL + X.

Then run source .bashrc to have the changes take effect. Now running which python3 should output the following: /home/ubuntu/anaconda3/bin/python3

Alternatively add the path directly to terminal:export PATH=/home/ubuntu/anaconda3/bin:$PATH

`Setup a password to secure the Jupyter notebook.`

In the terminal typeipython and then type:

[1]. from IPython.lib import passwd
[2]. passwd()

Then enter the passphrase and save the hash key

`Secure the Server with SSL certificate`

Since our server will be open to the web, we will use OpenSSL to add an SSL certification to act as an added security layer.

Start by making a new directory in home using mkdir certs and then cd into it and then type the following :

sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

Then change the security group of the file from root to ubuntu using:

sudo chown ubuntu:ubuntu mycert.pem

Configuring Jupyter Notebook

Type the following command into terminal to generate a configuration file for Jupyter notebook:

jupyter notebook --generate-config

Then cd into .jupyter and type sudo nano jupyter_notebook_config.py and type the following:

Launching Jupyter notebook

The Jupyter notebook should be set up now and we can access it by typing jupyter notebook in terminal. Copy the public DNS of your AWS EC2 instance from your AWS console. Then got to a browser and type : https://publicDNSname:8888. After entering the passphrase, you should be able to access the Jupyter notebook via AWS.

An example : https://3.95.24.115:8888

Installing PySpark

Before Installing PySpark we need to install Java and Scala

Use sudo apt install default-jdk to install java. You can get the version of java installed using java -version.

sudo apt-get install scala to install scala. It’s version at the time of this set-up is: 2.11.12

wget http://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz to download spark and Hadoop

sudo tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz to install the package

pip3 install py4j -py4j enables python programs running in a python interpreter to dynamically access Java objects in a Java Virtual Machine.

pip3 install findspark – library for Jupyter notebook to find the PySpark path.

Set the SPARK_HOME environment variable to the Spark installation directory and update the PATH environment variable by executing the following commands:

[1]. export SPARK_HOME=/home/ubuntu/spark-3.0.1-bin-hadoop3.2
[2]. export PATH=$SPARK_HOME/bin:$PATH
[3]. export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH

The 3rd line adds PySpark classes to python path. Alternatively, these can be entered into .bashrc file and then save and exit and run source .bashrc

Launch the jupyter notebook again and type the following, if no PySpark module found, to connect to PySpark

1.  import findspark
2.  findspark.init('/home/ubuntu/spark-3.0.0-bin-hadoop3.2')
3.  import pyspark

Note: To save all the work done so far. We can save the instance as an image. This way even if it gets deleted, we can retrieve it back again fast.

Now that the environment set up is out of the way we can build our recommendation system on jupyter notebook connected via SSH