Collaborative Filter based Recommendation System using Spark on AWS EC2
In this article I’ll be going over how to implement a collaborative filter based recommendation using spark. We will be setting up an AWS EC2 instance to run out model and connect to it via SSH. Before we get started with building the recommendation system on spark, we need to set up the environment.
Environment Setup
1. First, we need to create an ec2 instance on AWS
2. Use SSH to connect to ec2 instance
3. Set-up spark and jupyter notebook on EC2 Instance
Creating an EC2 Instance:
Step 1: Choose an AMI — Create an instance Image as Ubuntu Server 18.04 LTS. Setting up the
Step 2: Choose Instance type- select the free-tier t2.micro instance type
Step 3: Configure Instance Details- keep everything to default. The set-up is done on AWS educate for we also want to keep charges to a minimum. For large datasets we can increase the number of instances for more clusters.
Step 4: Add Storage — Keep everything to default
Step 5: Tag Instance — Add a name to the instance and put any name under value.
Step 6: Security Group — Set up the security group as shown below. Can change the IP to your IP for more security but we will be adding a password to jupyter
Step 7: Review and launch — Create a new key pair, if it’s first time you’re creating the instance or select an existing key par.
Note: Key-pair can only be downloaded once
SSH has two ways for authenticating:
1. Use username and password
2. Key-pair (SSH for AWS servers is set by default to not allow log in using user-pass. Have to use SSH Key-pair)
Key-pair consists of private version of the key and public version of the key.
Private key stays with us and it is used to log in and public key is used to set up the AWS instance
Using SSH to connect to EC2 Instance:
Download PuTTY from https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
Using this we will be connecting to our created instance. When PuTTY is downloaded and installed three applications are installed:
1. PuTTY — used for connecting to EC2 via SSH
2. PuTTYgen — used for generating a .ppk key
3. Pageant — used for loading the key
SSH key AWS gave us is in .pem format, which PuTTY doesn’t recognize — USE PuTTYgen to transform it into usable format (.ppk)
Use passphrase to save the private key
Before connecting to server via SSH open the .ppk key in Pageant using passphrase
Copy the public IPv4 address of the server and paste it into PuTTY and the username will be ubuntu and then open to connect to the ubuntu instance.
Click open and now the EC2 instance should be connect via SSH
Setting up Spark and Jupyter on EC2 Instance
Tasks on EC2 Instance:
1. Setup Jupyter Notebook
2. Download and Install Spark
3. Connect with PySpark
4. Access EC2 Jupyter Notebook using local computer
Setting up Jupyter Notebook:
Start by executing sudo apt-get update
when the instance is connected
sudo apt install python3-pip
to install pip3
sudo apt-get install jupyter
to install jupyter
wget https://repo.anaconda.com/archive/Anaconda3-2020-11-Linux-x86_64.sh
bash Anaconda3-2020.11-Linux-x86_64.sh
to run the file. Then accept the terms and the default location home/ubuntu/anaconda3
The default python that will be used will be the python we downloaded, but we want to use the python from anaconda. To check the python being used we can use the command which python3
and the it should output /usr/bin/python3
If we use ls -a
we can see the root file .bashrc
. We need to edit this and add our default python path which is from anaconda3. To do this run sudo nano .bashrc
and enter the following line of code export PATH=”/home/ubuntu/anaconda3/bin:$PATH”
. Then save changes and exit out of the editor using CTRL + X.
Then run source .bashrc
to have the changes take effect. Now running which python3
should output the following: /home/ubuntu/anaconda3/bin/python3
Alternatively add the path directly to terminal:export PATH=/home/ubuntu/anaconda3/bin:$PATH
Setup a password to secure the Jupyter notebook.
In the terminal typeipython
and then type:
[1]. from IPython.lib import passwd
[2]. passwd()
Then enter the passphrase and save the hash key
Secure the Server with SSL certificate
Since our server will be open to the web, we will use OpenSSL to add an SSL certification to act as an added security layer.
Start by making a new directory in home using mkdir certs and then cd into it and then type the following :
sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
Then change the security group of the file from root to ubuntu using:
sudo chown ubuntu:ubuntu mycert.pem
Configuring Jupyter Notebook
Type the following command into terminal to generate a configuration file for Jupyter notebook:
jupyter notebook --generate-config
Then cd into .jupyter
and type sudo nano jupyter_notebook_config.py
and type the following:
Launching Jupyter notebook
The Jupyter notebook should be set up now and we can access it by typing jupyter notebook
in terminal. Copy the public DNS of your AWS EC2 instance from your AWS console. Then got to a browser and type : https://publicDNSname:8888. After entering the passphrase, you should be able to access the Jupyter notebook via AWS.
An example : https://3.95.24.115:8888
Installing PySpark
Before Installing PySpark we need to install Java and Scala
Use sudo apt install default-jdk
to install java. You can get the version of java installed using java -version.
sudo apt-get install scala
to install scala. It’s version at the time of this set-up is: 2.11.12
wget http://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
to download spark and Hadoop
sudo tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz
to install the package
pip3 install py4j
-py4j enables python programs running in a python interpreter to dynamically access Java objects in a Java Virtual Machine.
pip3 install findspark
– library for Jupyter notebook to find the PySpark path.
Set the SPARK_HOME environment variable to the Spark installation directory and update the PATH environment variable by executing the following commands:
[1]. export SPARK_HOME=/home/ubuntu/spark-3.0.1-bin-hadoop3.2
[2]. export PATH=$SPARK_HOME/bin:$PATH
[3]. export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
The 3rd line adds PySpark classes to python path. Alternatively, these can be entered into .bashrc file and then save and exit and run source .bashrc
Launch the jupyter notebook again and type the following, if no PySpark module found, to connect to PySpark
1. import findspark
2. findspark.init('/home/ubuntu/spark-3.0.0-bin-hadoop3.2')
3. import pyspark
Note: To save all the work done so far. We can save the instance as an image. This way even if it gets deleted, we can retrieve it back again fast.
Now that the environment set up is out of the way we can build our recommendation system on jupyter notebook connected via SSH