Background

In the past time, I used CoCalc to complete the Machine Learning in Python course lab with my teammates. Indeed, its real-time collaboration is very fantastic. However, its modified Jupyter kernel sometimes act weird, and I have to use itsPlain Jupyter Servermode, but this mode is a normal Jupyter without collaboration. Besides, it lacks GPU accelerate and doubled its Standard Plan price after my first billing cycle. Therefore, I have sought for an alternative product. Finally, I found Google Colab which provided free GPU accelerate and modern material user interface.

Hello CoLab

The CoLab is very easy to use, and it stores the .ipynb file in Google Drive, we can easily share with friends and sync between multi devices. Its mechanism is very similar to Google Docs. It provides decent CPU quota and 12 GB RAM, also sufficient GPU quota for a small project. Since it is totally free, it is an excellent product for data scientist beginner!

GPU Playground

With the increase of complexity of neuron network, the CoLabs hosted backend cost intolerable time for calculation, especially on vast epochs. Moreover, its 12 GB RAM limited dataset transform, particularly use convolutional neuron network.

Fortunately, Professor Dr. Larson approved my access request to SMU ManeFrame II, it is a kind of HPC includes 36 accelerator nodes with NVIDIA P100 GPU, each node has 256 GB of DDR4-2400 RAM. The hardware equipment performs blazing fast for calculation works. It based on CentOS 7.x and use Slurm manage user workload.

SMU ManeFrame II

Although we can use port forward to use HPC plain Jupyter server on our local machine, it would be fascinating if we can use HPC works with CoLab together! Thankfully, Google CoLab team added local runtime support feature, and we can use the jupyter_http_over_ws plugin published by them to makes it work.

Walkthrough

I will use SMU ManeFrame II as the example, but this procedure will theoretically work on other HPC that use Slurm. Also, you can skip the HPC part and use it on your AWS GPU EC2 Instance, etc.

Conda Environment

First of all, we need to create a Conda virtual environment to let us easily maintain our library version and install packages. In ManeFrame II, the system GPU conda environment is in path /hpc/applications/anaconda/3/envs/tensorflow_gpu. Therefore we can simply clone this environment to our home directory and use -n parameter to name it tf, you can change the name to whatever you like.

conda create -n tf --clone="/hpc/applications/anaconda/3/envs/tensorflow_gpu"

After setup the conda env, you can update Python to latest version use:

conda update -n tf python

Also, you can update TensorFlow or Keras version as the same. And if you want to install a new package like OpenCV, use:

conda install -y -c conda-forge -n tf opencv

Since the Jupyter is running on the server side, you may want to use a password instead of token for convenience:

jupyter notebook --generate-config
jupyter notebook password

This two command above will create the Jupyter folder if necessary, and create notebook configuration file, then prompt you for your password and record the hashed password in your jupyter_notebook_config.json.

Request GPU Node

Then we can write a sbatch file to let Slurm request node for us:

#!/bin/bash
#SBATCH -J jupyter
#SBATCH -o tf_jupyter_%j.out
#SBATCH -p gpgpu-1
#SBATCH --gres=gpu:1
#SBATCH --mem=250G
#SBATCH --exclusive

module purge
module load tensorflow
source activate tf

login_node="login05.m2.smu.edu"

unset XDG_RUNTIME_DIR
port=60099
jupyter notebook --no-browser --NotebookApp.allow_origin='https://colab.research.google.com' --port=${port} &
sleep 30s
ssh -N -R ${port}:localhost:${port} ${login_node} &
wait

I saved the sbatch file on ~/script/jupyter_gpgpu.sbatch, you can change the login_node or port number to fulfill your demand. After that, submit our request use sbatch command:

sbatch ~/script/jupyter_gpgpu.sbatch

Then this request is in the Slurm queue and wait for resources, you can check the status by the following command:

squeue -u `whoami`

Connect to CoLab

After Slrum requested corresponding resources for us, we can simply use SSH to forward remote ManeFrame II Jupyter port to our local machine. In this case, since we defined login node to login05 and port to 60099, we can use this command on our local machine:

ssh -L 60099:localhost:60099 your_username@login05.m2.smu.edu

And then open http://localhost:60099 in your web browser, login by the password you set to let Jupyter write credential information to Cookies. After that, we can open CoLab, and click the arrow down icon in CONNECT button, selectConnect to local runtime, input our port 60099 in the prompted modal:

Input Port

Click the CONNECT button, if everything goes well, it will connect to HPC runtime, and you are all set.

CoLab X SMU ManeFrame II

The CoLab will work like a charm, and you can enjoy your training!